Beir (benchmark suite)

`"beir"`

Beir is a suite of benchmarks to test zero-shot transfer.

Paper
GitHub

\cite{Thakur2021Beir}

Bibtex:

@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

`"beir/arguana"`

A version of the ArguAna Counterargs dataset, for argument retrieval.

queries

1.4K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/arguana queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
index_ref = pt.IndexRef.of('./indices/beir_arguana') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.arguana.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

8.7K docs

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/arguana docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
# Index beir/arguana
indexer = pt.IterDictIndexer('./indices/beir_arguana', meta={"docno": 47})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.arguana')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/arguana qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
index_ref = pt.IndexRef.of('./indices/beir_arguana') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.arguana.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Wachsmuth2018Arguana,Thakur2021Beir}

Bibtex:

@inproceedings{Wachsmuth2018Arguana, author = "Wachsmuth, Henning and Syed, Shahbaz and Stein, Benno", title = "Retrieval of the Best Counterargument without Prior Topic Knowledge", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", year = "2018", publisher = "Association for Computational Linguistics", location = "Melbourne, Australia", pages = "241--251", url = "http://aclweb.org/anthology/P18-1023" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 8674,
    "fields": {
      "doc_id": {
        "max_len": 47,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1406
  },
  "qrels": {
    "count": 1406,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1406
        }
      }
    }
  }
}

`"beir/climate-fever"`

A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.

queries

1.5K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/climate-fever queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
index_ref = pt.IndexRef.of('./indices/beir_climate-fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.climate-fever.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.4M docs

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/climate-fever docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
# Index beir/climate-fever
indexer = pt.IterDictIndexer('./indices/beir_climate-fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.climate-fever')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

4.7K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/climate-fever qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
index_ref = pt.IndexRef.of('./indices/beir_climate-fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.climate-fever.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Diggelmann2020CLIMATEFEVERAD,Thakur2021Beir}

Bibtex:

@article{Diggelmann2020CLIMATEFEVERAD, title={CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims}, author={T. Diggelmann and Jordan L. Boyd-Graber and Jannis Bulian and Massimiliano Ciaramita and Markus Leippold}, journal={ArXiv}, year={2020}, volume={abs/2012.00614} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5416593,
    "fields": {
      "doc_id": {
        "max_len": 221,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1535
  },
  "qrels": {
    "count": 4681,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4681
        }
      }
    }
  }
}

`"beir/cqadupstack/android"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the android StackExchange subforum.

queries

699 queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/android queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_android') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.android.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

23K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/android docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
# Index beir/cqadupstack/android
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_android')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.android')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.7K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/android qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_android') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.android.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 22998,
    "fields": {
      "doc_id": {
        "max_len": 5,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 699
  },
  "qrels": {
    "count": 1696,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1696
        }
      }
    }
  }
}

`"beir/cqadupstack/english"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the english StackExchange subforum.

queries

1.6K queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/english queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_english') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.english.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

40K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/english docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
# Index beir/cqadupstack/english
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_english')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.english')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

3.8K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/english qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_english') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.english.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 40221,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1570
  },
  "qrels": {
    "count": 3765,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 3765
        }
      }
    }
  }
}

`"beir/cqadupstack/gaming"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gaming StackExchange subforum.

queries

1.6K queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/gaming queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gaming') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.gaming.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

45K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/gaming docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
# Index beir/cqadupstack/gaming
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_gaming')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.gaming')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

2.3K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/gaming qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gaming') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.gaming.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 45301,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1595
  },
  "qrels": {
    "count": 2263,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 2263
        }
      }
    }
  }
}

`"beir/cqadupstack/gis"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gis StackExchange subforum.

queries

885 queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/gis queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gis') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.gis.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

38K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/gis docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
# Index beir/cqadupstack/gis
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_gis')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.gis')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.1K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/gis qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gis') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.gis.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 37637,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 885
  },
  "qrels": {
    "count": 1114,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1114
        }
      }
    }
  }
}

`"beir/cqadupstack/mathematica"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the mathematica StackExchange subforum.

queries

804 queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/mathematica queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_mathematica') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.mathematica.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

17K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/mathematica docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
# Index beir/cqadupstack/mathematica
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_mathematica')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.mathematica')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/mathematica qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_mathematica') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.mathematica.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 16705,
    "fields": {
      "doc_id": {
        "max_len": 5,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 804
  },
  "qrels": {
    "count": 1358,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1358
        }
      }
    }
  }
}

`"beir/cqadupstack/physics"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the physics StackExchange subforum.

queries

1.0K queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/physics queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_physics') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.physics.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

38K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/physics docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
# Index beir/cqadupstack/physics
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_physics')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.physics')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.9K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/physics qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_physics') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.physics.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 38316,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1039
  },
  "qrels": {
    "count": 1933,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1933
        }
      }
    }
  }
}

`"beir/cqadupstack/programmers"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the programmers StackExchange subforum.

queries

876 queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/programmers queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_programmers') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.programmers.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

32K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/programmers docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
# Index beir/cqadupstack/programmers
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_programmers')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.programmers')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.7K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/programmers qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_programmers') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.programmers.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 32176,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 876
  },
  "qrels": {
    "count": 1675,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1675
        }
      }
    }
  }
}

`"beir/cqadupstack/stats"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the stats StackExchange subforum.

queries

652 queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/stats queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_stats') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.stats.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

42K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/stats docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
# Index beir/cqadupstack/stats
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_stats')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.stats')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

913 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/stats qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_stats') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.stats.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 42269,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 652
  },
  "qrels": {
    "count": 913,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 913
        }
      }
    }
  }
}

`"beir/cqadupstack/tex"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the tex StackExchange subforum.

queries

2.9K queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/tex queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_tex') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.tex.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

68K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/tex docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
# Index beir/cqadupstack/tex
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_tex')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.tex')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

5.2K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/tex qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_tex') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.tex.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 68184,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 2906
  },
  "qrels": {
    "count": 5154,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 5154
        }
      }
    }
  }
}

`"beir/cqadupstack/unix"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the unix StackExchange subforum.

queries

1.1K queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/unix queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_unix') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.unix.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

47K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/unix docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
# Index beir/cqadupstack/unix
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_unix')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.unix')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.7K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/unix qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_unix') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.unix.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 47382,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1072
  },
  "qrels": {
    "count": 1693,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1693
        }
      }
    }
  }
}

`"beir/cqadupstack/webmasters"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the webmasters StackExchange subforum.

queries

506 queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/webmasters queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_webmasters') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.webmasters.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

17K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/webmasters docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
# Index beir/cqadupstack/webmasters
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_webmasters')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.webmasters')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/webmasters qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_webmasters') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.webmasters.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 17405,
    "fields": {
      "doc_id": {
        "max_len": 5,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 506
  },
  "qrels": {
    "count": 1395,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1395
        }
      }
    }
  }
}

`"beir/cqadupstack/wordpress"`

A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the wordpress StackExchange subforum.

queries

541 queries

Language: en

Query type:

BeirCqaQuery: (namedtuple)

query_id: str
text: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/wordpress queries



[query_id]    [text]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_wordpress') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.cqadupstack.wordpress.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

49K docs

Language: en

Document type:

BeirCqaDoc: (namedtuple)

doc_id: str
text: str
title: str
tags: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, tags>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/wordpress docs



[doc_id]    [text]    [title]    [tags]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
# Index beir/cqadupstack/wordpress
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_wordpress')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.cqadupstack.wordpress')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

744 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/cqadupstack/wordpress qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_wordpress') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.cqadupstack.wordpress.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hoogeveen2015CqaDupStack,Thakur2021Beir}

Bibtex:

@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 48605,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 541
  },
  "qrels": {
    "count": 744,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 744
        }
      }
    }
  }
}

`"beir/dbpedia-entity"`

A version of the DBPedia-Entity-v2 dataset for entity retrieval.

queries

467 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/dbpedia-entity queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.dbpedia-entity.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

4.6M docs

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/dbpedia-entity docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.dbpedia-entity')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Hasibi2017DBpediaEntityVA,Thakur2021Beir}

Bibtex:

@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 4635922,
    "fields": {
      "doc_id": {
        "max_len": 200,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 467
  }
}

`"beir/dbpedia-entity/dev"`

A random sample of 67 queries from the official test set, used as a dev set.

queries

67 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/dbpedia-entity/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.dbpedia-entity.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

4.6M docs

Inherits docs from beir/dbpedia-entity

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/dbpedia-entity/dev docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.dbpedia-entity.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

5.7K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/dbpedia-entity/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.dbpedia-entity.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hasibi2017DBpediaEntityVA,Thakur2021Beir}

Bibtex:

@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 4635922,
    "fields": {
      "doc_id": {
        "max_len": 200,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 67
  },
  "qrels": {
    "count": 5673,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 4268,
          "1": 1024,
          "2": 381
        }
      }
    }
  }
}

`"beir/dbpedia-entity/test"`

A the official test set, without 67 queries used as a dev set.

queries

400 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/dbpedia-entity/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.dbpedia-entity.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

4.6M docs

Inherits docs from beir/dbpedia-entity

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/dbpedia-entity/test docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.dbpedia-entity.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

44K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/dbpedia-entity/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.dbpedia-entity.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Hasibi2017DBpediaEntityVA,Thakur2021Beir}

Bibtex:

@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 4635922,
    "fields": {
      "doc_id": {
        "max_len": 200,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 400
  },
  "qrels": {
    "count": 43515,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 28229,
          "1": 8785,
          "2": 6501
        }
      }
    }
  }
}

`"beir/fever"`

A version of the FEVER dataset for fact verification. Includes queries from the /train /dev and /test subsets.

queries

123K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.fever.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.4M docs

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.fever')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Thorne2018Fever,Thakur2021Beir}

Bibtex:

@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5416568,
    "fields": {
      "doc_id": {
        "max_len": 221,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 123142
  }
}

`"beir/fever/dev"`

The official dev set.

queries

6.7K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.fever.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.4M docs

Inherits docs from beir/fever

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/dev docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.fever.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

8.1K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.fever.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Thorne2018Fever,Thakur2021Beir}

Bibtex:

@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5416568,
    "fields": {
      "doc_id": {
        "max_len": 221,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6666
  },
  "qrels": {
    "count": 8079,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 8079
        }
      }
    }
  }
}

`"beir/fever/test"`

The official test set.

queries

6.7K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.fever.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.4M docs

Inherits docs from beir/fever

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/test docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.fever.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

7.9K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.fever.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Thorne2018Fever,Thakur2021Beir}

Bibtex:

@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5416568,
    "fields": {
      "doc_id": {
        "max_len": 221,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6666
  },
  "qrels": {
    "count": 7937,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 7937
        }
      }
    }
  }
}

`"beir/fever/train"`

The official train set.

queries

110K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.fever.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.4M docs

Inherits docs from beir/fever

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/train docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.fever.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

140K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fever/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.fever.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Thorne2018Fever,Thakur2021Beir}

Bibtex:

@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5416568,
    "fields": {
      "doc_id": {
        "max_len": 221,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 109810
  },
  "qrels": {
    "count": 140085,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 140085
        }
      }
    }
  }
}

`"beir/fiqa"`

A version of the FIQA-2018 dataset (financial opinion question answering). Queries include those in the /train /dev and /test subsets.

queries

6.6K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.fiqa.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

58K docs

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.fiqa')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Maia2018Fiqa,Thakur2021Beir}

Bibtex:

@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 57638,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6648
  }
}

`"beir/fiqa/dev"`

Random sample of 500 queries from the official dataset.

queries

500 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.fiqa.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

58K docs

Inherits docs from beir/fiqa

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/dev docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.fiqa.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.2K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.fiqa.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Maia2018Fiqa,Thakur2021Beir}

Bibtex:

@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 57638,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 500
  },
  "qrels": {
    "count": 1238,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1238
        }
      }
    }
  }
}

`"beir/fiqa/test"`

Random sample of 648 queries from the official dataset.

queries

648 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.fiqa.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

58K docs

Inherits docs from beir/fiqa

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.fiqa.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.7K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.fiqa.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Maia2018Fiqa,Thakur2021Beir}

Bibtex:

@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 57638,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 648
  },
  "qrels": {
    "count": 1706,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1706
        }
      }
    }
  }
}

`"beir/fiqa/train"`

Official dataset without the 1148 queries sampled for /dev and /test.

queries

5.5K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.fiqa.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

58K docs

Inherits docs from beir/fiqa

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/train docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.fiqa.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

14K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/fiqa/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.fiqa.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Maia2018Fiqa,Thakur2021Beir}

Bibtex:

@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 57638,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 5500
  },
  "qrels": {
    "count": 14166,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 14166
        }
      }
    }
  }
}

`"beir/hotpotqa"`

A version of the Hotpot QA dataset for multi-hop question answering. Queries include all those in /train /dev and /test.

queries

98K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.hotpotqa.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.2M docs

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.hotpotqa')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Yang2018Hotpotqa,Thakur2021Beir}

Bibtex:

@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5233329,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 97852
  }
}

`"beir/hotpotqa/dev"`

Random selection of the 5447 queries from /train.

queries

5.4K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.hotpotqa.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.2M docs

Inherits docs from beir/hotpotqa

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/dev docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.hotpotqa.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

11K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.hotpotqa.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Yang2018Hotpotqa,Thakur2021Beir}

Bibtex:

@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5233329,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 5447
  },
  "qrels": {
    "count": 10894,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 10894
        }
      }
    }
  }
}

`"beir/hotpotqa/test"`

Official dev set from HotpotQA, here used as a test set.

queries

7.4K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.hotpotqa.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.2M docs

Inherits docs from beir/hotpotqa

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/test docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.hotpotqa.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

15K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.hotpotqa.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Yang2018Hotpotqa,Thakur2021Beir}

Bibtex:

@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5233329,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7405
  },
  "qrels": {
    "count": 14810,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 14810
        }
      }
    }
  }
}

`"beir/hotpotqa/train"`

Official train set, without the random selection of the 5447 queries used for /dev.

queries

85K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.hotpotqa.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.2M docs

Inherits docs from beir/hotpotqa

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/train docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.hotpotqa.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

170K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/hotpotqa/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.hotpotqa.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Yang2018Hotpotqa,Thakur2021Beir}

Bibtex:

@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5233329,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 85000
  },
  "qrels": {
    "count": 170000,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 170000
        }
      }
    }
  }
}

`"beir/msmarco"`

A version of the MS MARCO passage ranking dataset. Includes queries from the /train, /dev, and /test sub-datasets.

Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.

queries

510K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.msmarco.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

8.8M docs

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.msmarco')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco,Thakur2021Beir}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 509962
  }
}

`"beir/msmarco/dev"`

A version of the MS MARCO passage ranking dev set.

See also: msmarco-passage/dev
Dataset Paper

queries

7.0K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.msmarco.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

8.8M docs

Inherits docs from beir/msmarco

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/dev docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.msmarco.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

7.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.msmarco.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco,Thakur2021Beir}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6980
  },
  "qrels": {
    "count": 7437,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 7437
        }
      }
    }
  }
}

`"beir/msmarco/test"`

A version of the TREC Deep Learning 2019 set.

See also: msmarco-passage/trec-dl-2019
Shared Task Paper

queries

43 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.msmarco.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

8.8M docs

Inherits docs from beir/msmarco

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.msmarco.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

9.3K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.msmarco.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Craswell2019TrecDl,Bajaj2016Msmarco,Thakur2021Beir}

Bibtex:

@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 43
  },
  "qrels": {
    "count": 9260,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 5158,
          "1": 1601,
          "2": 1804,
          "3": 697
        }
      }
    }
  }
}

`"beir/msmarco/train"`

A version of the MS MARCO passage ranking train set.

See also: msmarco-passage/train

queries

503K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.msmarco.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

8.8M docs

Inherits docs from beir/msmarco

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/train docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.msmarco.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

533K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/msmarco/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.msmarco.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Bajaj2016Msmarco,Thakur2021Beir}

Bibtex:

@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 8841823,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 502939
  },
  "qrels": {
    "count": 532751,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 532751
        }
      }
    }
  }
}

`"beir/nfcorpus"`

A version of the NF Corpus (Nutrition Facts). Queries use the "title" variant of the query, which here are often natural language questions. Queries include all those from /train /dev and /test.

Data pre-processing may be different than what is done in nfcorpus.

Dataset website

Dataset paper

See also: nfcorpus

queries

3.2K queries

Language: en

Query type:

BeirUrlQuery: (namedtuple)

query_id: str
text: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus queries



[query_id]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.nfcorpus.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

3.6K docs

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.nfcorpus')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus,Thakur2021Beir}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 3633,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "MED-"
      }
    }
  },
  "queries": {
    "count": 3237
  }
}

`"beir/nfcorpus/dev"`

Combined dev set of NFCorpus.

See also: nfcorpus/dev

queries

324 queries

Language: en

Query type:

BeirUrlQuery: (namedtuple)

query_id: str
text: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/dev queries



[query_id]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.nfcorpus.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

3.6K docs

Inherits docs from beir/nfcorpus

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/dev docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.nfcorpus.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

11K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.nfcorpus.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus,Thakur2021Beir}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 3633,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "MED-"
      }
    }
  },
  "queries": {
    "count": 324
  },
  "qrels": {
    "count": 11385,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 521,
          "1": 10864
        }
      }
    }
  }
}

`"beir/nfcorpus/test"`

Combined test set of NFCorpus.

See also: nfcorpus/test

queries

323 queries

Language: en

Query type:

BeirUrlQuery: (namedtuple)

query_id: str
text: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/test queries



[query_id]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.nfcorpus.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

3.6K docs

Inherits docs from beir/nfcorpus

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/test docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.nfcorpus.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

12K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.nfcorpus.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus,Thakur2021Beir}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 3633,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "MED-"
      }
    }
  },
  "queries": {
    "count": 323
  },
  "qrels": {
    "count": 12334,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 576,
          "1": 11758
        }
      }
    }
  }
}

`"beir/nfcorpus/train"`

Combined train set of NFCorpus.

See also: nfcorpus/train

queries

2.6K queries

Language: en

Query type:

BeirUrlQuery: (namedtuple)

query_id: str
text: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/train queries



[query_id]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.nfcorpus.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

3.6K docs

Inherits docs from beir/nfcorpus

Language: en

Document type:

BeirTitleUrlDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/train docs



[doc_id]    [text]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.nfcorpus.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

111K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nfcorpus/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.nfcorpus.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus,Thakur2021Beir}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 3633,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": "MED-"
      }
    }
  },
  "queries": {
    "count": 2590
  },
  "qrels": {
    "count": 110575,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 110575
        }
      }
    }
  }
}

`"beir/nq"`

A version of the Natural Questions dev dataset.

Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.

Dataset website
Dataset paper
See also: natural-questions, dpr-w100/natural-questions

queries

3.5K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nq")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nq queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
index_ref = pt.IndexRef.of('./indices/beir_nq') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.nq.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

2.7M docs

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nq")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nq docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
# Index beir/nq
indexer = pt.IterDictIndexer('./indices/beir_nq')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.nq')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

4.2K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/nq")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/nq qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
index_ref = pt.IndexRef.of('./indices/beir_nq') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.nq.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Kwiatkowski2019Nq,Thakur2021Beir}

Bibtex:

@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 2681468,
    "fields": {
      "doc_id": {
        "max_len": 10,
        "common_prefix": "doc"
      }
    }
  },
  "queries": {
    "count": 3452
  },
  "qrels": {
    "count": 4201,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4201
        }
      }
    }
  }
}

`"beir/quora"`

A version of the Quora duplicate question detection dataset (QQP). Includes queries from /dev and /test sets.

Dataset website

queries

15K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/quora")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/quora queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.quora.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

523K docs

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/quora")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/quora docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.quora')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Thakur2021Beir}

Bibtex:

@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 522931,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 15000
  }
}

`"beir/quora/dev"`

A 5,000 question subset of the original dataset, without overlaps in the other subsets.

queries

5.0K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/quora/dev queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.quora.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

523K docs

Inherits docs from beir/quora

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/quora/dev docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.quora.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

7.6K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/quora/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.quora.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Thakur2021Beir}

Bibtex:

@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 522931,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 5000
  },
  "qrels": {
    "count": 7626,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 7626
        }
      }
    }
  }
}

`"beir/quora/test"`

A 10,000 question subset of the original dataset, without overlaps in the other subsets.

queries

10K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/quora/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.quora.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

523K docs

Inherits docs from beir/quora

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/quora/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.quora.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

16K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/quora/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.quora.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Thakur2021Beir}

Bibtex:

@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 522931,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 10000
  },
  "qrels": {
    "count": 15675,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 15675
        }
      }
    }
  }
}

`"beir/scidocs"`

A version of the SciDocs dataset, used for citation retrieval.

queries

1.0K queries

Language: en

Query type:

BeirSciQuery: (namedtuple)

query_id: str
text: str
authors: List[str]
year: int
cited_by: List[str]
references: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, authors, year, cited_by, references>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scidocs queries



[query_id]    [text]    [authors]    [year]    [cited_by]    [references]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
index_ref = pt.IndexRef.of('./indices/beir_scidocs') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.scidocs.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

26K docs

Language: en

Document type:

BeirSciDoc: (namedtuple)

doc_id: str
text: str
title: str
authors: List[str]
year: int
cited_by: List[str]
references: List[str]

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, authors, year, cited_by, references>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scidocs docs



[doc_id]    [text]    [title]    [authors]    [year]    [cited_by]    [references]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
# Index beir/scidocs
indexer = pt.IterDictIndexer('./indices/beir_scidocs', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.scidocs')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

30K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scidocs qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
index_ref = pt.IndexRef.of('./indices/beir_scidocs') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.scidocs.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Cohan2020Scidocs,Thakur2021Beir}

Bibtex:

@inproceedings{Cohan2020Scidocs, title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers", author = "Cohan, Arman and Feldman, Sergey and Beltagy, Iz and Downey, Doug and Weld, Daniel", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.207", doi = "10.18653/v1/2020.acl-main.207", pages = "2270--2282" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 25657,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1000
  },
  "qrels": {
    "count": 29928,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4928,
          "0": 25000
        }
      }
    }
  }
}

`"beir/scifact"`

A version of the SciFact dataset, for fact verification. Queries include those form the /train and /test sets.

queries

1.1K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scifact")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scifact queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.scifact.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.2K docs

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scifact")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scifact docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.scifact')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Wadden2020Scifact,Thakur2021Beir}

Bibtex:

@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5183,
    "fields": {
      "doc_id": {
        "max_len": 9,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1109
  }
}

`"beir/scifact/test"`

The official dev set.

queries

300 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scifact/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.scifact.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.2K docs

Inherits docs from beir/scifact

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scifact/test docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.scifact.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

339 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scifact/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.scifact.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Wadden2020Scifact,Thakur2021Beir}

Bibtex:

@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5183,
    "fields": {
      "doc_id": {
        "max_len": 9,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 300
  },
  "qrels": {
    "count": 339,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 339
        }
      }
    }
  }
}

`"beir/scifact/train"`

The official train set.

queries

809 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scifact/train queries



[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.scifact.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

5.2K docs

Inherits docs from beir/scifact

Language: en

Document type:

BeirTitleDoc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scifact/train docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.scifact.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

919 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/scifact/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.scifact.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Wadden2020Scifact,Thakur2021Beir}

Bibtex:

@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 5183,
    "fields": {
      "doc_id": {
        "max_len": 9,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 809
  },
  "qrels": {
    "count": 919,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 919
        }
      }
    }
  }
}

`"beir/trec-covid"`

A version of the TREC COVID (complete) dataset, with titles and abstracts as documents. Queries are the question variant.

Data pre-processing may be different than what is done in cord19/trec-covid.

queries

50 queries

Language: en

Query type:

BeirCovidQuery: (namedtuple)

query_id: str
text: str
query: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, query, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export beir/trec-covid queries



[query_id]    [text]    [query]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
index_ref = pt.IndexRef.of('./indices/beir_trec-covid') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.trec-covid.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

171K docs

Language: en

Document type:

BeirCordDoc: (namedtuple)

doc_id: str
text: str
title: str
url: str
pubmed_id: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, url, pubmed_id>

You can find more details about the Python API here.

CLI

ir_datasets export beir/trec-covid docs



[doc_id]    [text]    [title]    [url]    [pubmed_id]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
# Index beir/trec-covid
indexer = pt.IterDictIndexer('./indices/beir_trec-covid')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url', 'pubmed_id'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.trec-covid')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

66K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/trec-covid qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
index_ref = pt.IndexRef.of('./indices/beir_trec-covid') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.trec-covid.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Wang2020Cord19,Voorhees2020TrecCovid,Thakur2021Beir}

Bibtex:

@article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} } @article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 171332,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 66336,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 14217,
          "1": 10456,
          "0": 41661,
          "-1": 2
        }
      }
    }
  }
}

`"beir/webis-touche2020"`

Original version of the Touchè-2020 dataset, for argument retrieval.

Consider using beir/webis-touche2020/v2 instead; it uses an updated, more complete version of the qrels.

queries

49 queries

Language: en

Query type:

BeirToucheQuery: (namedtuple)

query_id: str
text: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export beir/webis-touche2020 queries



[query_id]    [text]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.webis-touche2020.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

383K docs

Language: en

Document type:

BeirToucheDoc: (namedtuple)

doc_id: str
text: str
title: str
stance: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, stance, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/webis-touche2020 docs



[doc_id]    [text]    [title]    [stance]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
# Index beir/webis-touche2020
indexer = pt.IterDictIndexer('./indices/beir_webis-touche2020', meta={"docno": 39})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'stance', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.webis-touche2020')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

3.0K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/webis-touche2020 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.webis-touche2020.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Bondarenko2020Tuche,Thakur2021Beir}

Bibtex:

@inproceedings{Bondarenko2020Tuche, title={Overview of Touch{\'e} 2020: Argument Retrieval}, author={Alexander Bondarenko and Maik Fr{\"o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Christian Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle={CLEF}, year={2020} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 382545,
    "fields": {
      "doc_id": {
        "max_len": 39,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 49
  },
  "qrels": {
    "count": 2962,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "4": 1006,
          "5": 398,
          "3": 628,
          "2": 195,
          "-2": 549,
          "1": 186
        }
      }
    }
  }
}

`"beir/webis-touche2020/v2"`

Version 2 of the Touchè-2020 dataset, for argument retrieval. This version uses the "corrected" version of the qrels, mapped to version 1 of the corpus.

queries

49 queries

Language: en

Query type:

BeirToucheQuery: (namedtuple)

query_id: str
text: str
description: str
narrative: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020/v2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export beir/webis-touche2020/v2 queries



[query_id]    [text]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

XPM-IR

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.beir.webis-touche2020.v2.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

383K docs

Language: en

Document type:

BeirToucheDoc: (namedtuple)

doc_id: str
text: str
title: str
stance: str
url: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020/v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title, stance, url>

You can find more details about the Python API here.

CLI

ir_datasets export beir/webis-touche2020/v2 docs



[doc_id]    [text]    [title]    [stance]    [url]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
# Index beir/webis-touche2020/v2
indexer = pt.IterDictIndexer('./indices/beir_webis-touche2020_v2', meta={"docno": 39})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'stance', 'url'])

You can find more details about PyTerrier indexing here.

XPM-IR

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.beir.webis-touche2020.v2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

2.2K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020/v2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export beir/webis-touche2020/v2 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.beir.webis-touche2020.v2.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Bondarenko2020Tuche,Thakur2021Beir}

Bibtex:

@inproceedings{Bondarenko2020Tuche, title={Overview of Touch{\'e} 2020: Argument Retrieval}, author={Alexander Bondarenko and Maik Fr{\"o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Christian Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle={CLEF}, year={2020} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }

Metadata

{
  "docs": {
    "count": 382545,
    "fields": {
      "doc_id": {
        "max_len": 39,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 49
  },
  "qrels": {
    "count": 2214,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 1282,
          "1": 296,
          "2": 636
        }
      }
    }
  }
}

ir_datasets: Beir (benchmark suite)

"beir"

"beir/arguana"

"beir/climate-fever"

"beir/cqadupstack/android"

"beir/cqadupstack/english"

"beir/cqadupstack/gaming"

"beir/cqadupstack/gis"

"beir/cqadupstack/mathematica"

"beir/cqadupstack/physics"

"beir/cqadupstack/programmers"

"beir/cqadupstack/stats"

"beir/cqadupstack/tex"

"beir/cqadupstack/unix"

"beir/cqadupstack/webmasters"

"beir/cqadupstack/wordpress"

"beir/dbpedia-entity"

"beir/dbpedia-entity/dev"

"beir/dbpedia-entity/test"

"beir/fever"

"beir/fever/dev"

"beir/fever/test"

"beir/fever/train"

"beir/fiqa"

"beir/fiqa/dev"

"beir/fiqa/test"

"beir/fiqa/train"

"beir/hotpotqa"

"beir/hotpotqa/dev"

"beir/hotpotqa/test"

"beir/hotpotqa/train"

"beir/msmarco"

"beir/msmarco/dev"

"beir/msmarco/test"

"beir/msmarco/train"

"beir/nfcorpus"

"beir/nfcorpus/dev"

"beir/nfcorpus/test"

"beir/nfcorpus/train"

"beir/nq"

"beir/quora"

"beir/quora/dev"

"beir/quora/test"

"beir/scidocs"

"beir/scifact"

"beir/scifact/test"

"beir/scifact/train"

"beir/trec-covid"

"beir/webis-touche2020"

"beir/webis-touche2020/v2"

`ir_datasets`: Beir (benchmark suite)

`"beir"`

`"beir/arguana"`

`"beir/climate-fever"`

`"beir/cqadupstack/android"`

`"beir/cqadupstack/english"`

`"beir/cqadupstack/gaming"`

`"beir/cqadupstack/gis"`

`"beir/cqadupstack/mathematica"`

`"beir/cqadupstack/physics"`

`"beir/cqadupstack/programmers"`

`"beir/cqadupstack/stats"`

`"beir/cqadupstack/tex"`

`"beir/cqadupstack/unix"`

`"beir/cqadupstack/webmasters"`

`"beir/cqadupstack/wordpress"`

`"beir/dbpedia-entity"`

`"beir/dbpedia-entity/dev"`

`"beir/dbpedia-entity/test"`

`"beir/fever"`

`"beir/fever/dev"`

`"beir/fever/test"`

`"beir/fever/train"`

`"beir/fiqa"`

`"beir/fiqa/dev"`

`"beir/fiqa/test"`

`"beir/fiqa/train"`

`"beir/hotpotqa"`

`"beir/hotpotqa/dev"`

`"beir/hotpotqa/test"`

`"beir/hotpotqa/train"`

`"beir/msmarco"`

`"beir/msmarco/dev"`

`"beir/msmarco/test"`

`"beir/msmarco/train"`

`"beir/nfcorpus"`

`"beir/nfcorpus/dev"`

`"beir/nfcorpus/test"`

`"beir/nfcorpus/train"`

`"beir/nq"`

`"beir/quora"`

`"beir/quora/dev"`

`"beir/quora/test"`

`"beir/scidocs"`

`"beir/scifact"`

`"beir/scifact/test"`

`"beir/scifact/train"`

`"beir/trec-covid"`

`"beir/webis-touche2020"`

`"beir/webis-touche2020/v2"`