ir_datasets
: Beir (benchmark suite)Bibtex:
@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }A version of the ArguAna Counterargs dataset, for argument retrieval.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/arguana queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
index_ref = pt.IndexRef.of('./indices/beir_arguana') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/arguana docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
# Index beir/arguana
indexer = pt.IterDictIndexer('./indices/beir_arguana', meta={"docno": 47})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/arguana qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/arguana')
index_ref = pt.IndexRef.of('./indices/beir_arguana') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Wachsmuth2018Arguana, author = "Wachsmuth, Henning and Syed, Shahbaz and Stein, Benno", title = "Retrieval of the Best Counterargument without Prior Topic Knowledge", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", year = "2018", publisher = "Association for Computational Linguistics", location = "Melbourne, Australia", pages = "241--251", url = "http://aclweb.org/anthology/P18-1023" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 8674, "fields": { "doc_id": { "max_len": 47, "common_prefix": "" } } }, "queries": { "count": 1406 }, "qrels": { "count": 1406, "fields": { "relevance": { "counts_by_value": { "1": 1406 } } } } }
A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/climate-fever queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
index_ref = pt.IndexRef.of('./indices/beir_climate-fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/climate-fever docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
# Index beir/climate-fever
indexer = pt.IterDictIndexer('./indices/beir_climate-fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/climate-fever qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/climate-fever')
index_ref = pt.IndexRef.of('./indices/beir_climate-fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Diggelmann2020CLIMATEFEVERAD, title={CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims}, author={T. Diggelmann and Jordan L. Boyd-Graber and Jannis Bulian and Massimiliano Ciaramita and Markus Leippold}, journal={ArXiv}, year={2020}, volume={abs/2012.00614} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5416593, "fields": { "doc_id": { "max_len": 221, "common_prefix": "" } } }, "queries": { "count": 1535 }, "qrels": { "count": 4681, "fields": { "relevance": { "counts_by_value": { "1": 4681 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the android StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/android queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_android') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/android docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
# Index beir/cqadupstack/android
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_android')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/android")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/android qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/android')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_android') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 22998, "fields": { "doc_id": { "max_len": 5, "common_prefix": "" } } }, "queries": { "count": 699 }, "qrels": { "count": 1696, "fields": { "relevance": { "counts_by_value": { "1": 1696 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the english StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/english queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_english') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/english docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
# Index beir/cqadupstack/english
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_english')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/english")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/english qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/english')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_english') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 40221, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 1570 }, "qrels": { "count": 3765, "fields": { "relevance": { "counts_by_value": { "1": 3765 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gaming StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/gaming queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gaming') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/gaming docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
# Index beir/cqadupstack/gaming
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_gaming')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gaming")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/gaming qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gaming')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gaming') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 45301, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 1595 }, "qrels": { "count": 2263, "fields": { "relevance": { "counts_by_value": { "1": 2263 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the gis StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/gis queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gis') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/gis docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
# Index beir/cqadupstack/gis
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_gis')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/gis")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/gis qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/gis')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_gis') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 37637, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 885 }, "qrels": { "count": 1114, "fields": { "relevance": { "counts_by_value": { "1": 1114 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the mathematica StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/mathematica queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_mathematica') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/mathematica docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
# Index beir/cqadupstack/mathematica
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_mathematica')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/mathematica")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/mathematica qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/mathematica')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_mathematica') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 16705, "fields": { "doc_id": { "max_len": 5, "common_prefix": "" } } }, "queries": { "count": 804 }, "qrels": { "count": 1358, "fields": { "relevance": { "counts_by_value": { "1": 1358 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the physics StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/physics queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_physics') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/physics docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
# Index beir/cqadupstack/physics
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_physics')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/physics")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/physics qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/physics')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_physics') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 38316, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 1039 }, "qrels": { "count": 1933, "fields": { "relevance": { "counts_by_value": { "1": 1933 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the programmers StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/programmers queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_programmers') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/programmers docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
# Index beir/cqadupstack/programmers
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_programmers')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/programmers")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/programmers qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/programmers')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_programmers') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 32176, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 876 }, "qrels": { "count": 1675, "fields": { "relevance": { "counts_by_value": { "1": 1675 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the stats StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/stats queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_stats') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/stats docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
# Index beir/cqadupstack/stats
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_stats')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/stats")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/stats qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/stats')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_stats') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 42269, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 652 }, "qrels": { "count": 913, "fields": { "relevance": { "counts_by_value": { "1": 913 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the tex StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/tex queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_tex') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/tex docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
# Index beir/cqadupstack/tex
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_tex')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/tex")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/tex qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/tex')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_tex') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 68184, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 2906 }, "qrels": { "count": 5154, "fields": { "relevance": { "counts_by_value": { "1": 5154 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the unix StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/unix queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_unix') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/unix docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
# Index beir/cqadupstack/unix
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_unix')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/unix")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/unix qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/unix')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_unix') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 47382, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 1072 }, "qrels": { "count": 1693, "fields": { "relevance": { "counts_by_value": { "1": 1693 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the webmasters StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/webmasters queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_webmasters') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/webmasters docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
# Index beir/cqadupstack/webmasters
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_webmasters')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/webmasters")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/webmasters qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/webmasters')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_webmasters') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 17405, "fields": { "doc_id": { "max_len": 5, "common_prefix": "" } } }, "queries": { "count": 506 }, "qrels": { "count": 1395, "fields": { "relevance": { "counts_by_value": { "1": 1395 } } } } }
A version of the CQADupStack dataset, for duplicate question retrieval. This subset is from the wordpress StackExchange subforum.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/wordpress queries
[query_id] [text] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_wordpress') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, tags>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/wordpress docs
[doc_id] [text] [title] [tags]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
# Index beir/cqadupstack/wordpress
indexer = pt.IterDictIndexer('./indices/beir_cqadupstack_wordpress')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/cqadupstack/wordpress")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/cqadupstack/wordpress qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/cqadupstack/wordpress')
index_ref = pt.IndexRef.of('./indices/beir_cqadupstack_wordpress') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hoogeveen2015CqaDupStack, title={{CQADupStack}: A Benchmark Data Set for Community Question-Answering Research}, author={D. Hoogeveen and Karin M. Verspoor and Timothy Baldwin}, journal={Proceedings of the 20th Australasian Document Computing Symposium}, year={2015} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 48605, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 541 }, "qrels": { "count": 744, "fields": { "relevance": { "counts_by_value": { "1": 744 } } } } }
A version of the DBPedia-Entity-v2 dataset for entity retrieval.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/dbpedia-entity queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/dbpedia-entity docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Bibtex:
@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 4635922, "fields": { "doc_id": { "max_len": 200, "common_prefix": "" } } }, "queries": { "count": 467 } }
A random sample of 67 queries from the official test set, used as a dev set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/dbpedia-entity/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/dbpedia-entity
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/dbpedia-entity/dev docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/dbpedia-entity/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 4635922, "fields": { "doc_id": { "max_len": 200, "common_prefix": "" } } }, "queries": { "count": 67 }, "qrels": { "count": 5673, "fields": { "relevance": { "counts_by_value": { "0": 4268, "1": 1024, "2": 381 } } } } }
A the official test set, without 67 queries used as a dev set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/dbpedia-entity/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/dbpedia-entity
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/dbpedia-entity/test docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
# Index beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/dbpedia-entity/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/dbpedia-entity/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/dbpedia-entity/test')
index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 4635922, "fields": { "doc_id": { "max_len": 200, "common_prefix": "" } } }, "queries": { "count": 400 }, "qrels": { "count": 43515, "fields": { "relevance": { "counts_by_value": { "0": 28229, "1": 8785, "2": 6501 } } } } }
A version of the FEVER dataset for fact verification. Includes queries from the /train /dev and /test subsets.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fever queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/fever docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5416568, "fields": { "doc_id": { "max_len": 221, "common_prefix": "" } } }, "queries": { "count": 123142 } }
The official dev set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fever/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/fever
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/fever/dev docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/fever/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/dev')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5416568, "fields": { "doc_id": { "max_len": 221, "common_prefix": "" } } }, "queries": { "count": 6666 }, "qrels": { "count": 8079, "fields": { "relevance": { "counts_by_value": { "1": 8079 } } } } }
The official test set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fever/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/fever
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/fever/test docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/fever/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/test')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5416568, "fields": { "doc_id": { "max_len": 221, "common_prefix": "" } } }, "queries": { "count": 6666 }, "qrels": { "count": 7937, "fields": { "relevance": { "counts_by_value": { "1": 7937 } } } } }
The official train set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fever/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/fever
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/fever/train docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
# Index beir/fever
indexer = pt.IterDictIndexer('./indices/beir_fever', meta={"docno": 221})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fever/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/fever/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fever/train')
index_ref = pt.IndexRef.of('./indices/beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5416568, "fields": { "doc_id": { "max_len": 221, "common_prefix": "" } } }, "queries": { "count": 109810 }, "qrels": { "count": 140085, "fields": { "relevance": { "counts_by_value": { "1": 140085 } } } } }
A version of the FIQA-2018 dataset (financial opinion question answering). Queries include those in the /train /dev and /test subsets.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fiqa queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 57638, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 6648 } }
Random sample of 500 queries from the official dataset.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/fiqa
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/dev docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 57638, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 500 }, "qrels": { "count": 1238, "fields": { "relevance": { "counts_by_value": { "1": 1238 } } } } }
Random sample of 648 queries from the official dataset.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/fiqa
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/test')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 57638, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 648 }, "qrels": { "count": 1706, "fields": { "relevance": { "counts_by_value": { "1": 1706 } } } } }
Official dataset without the 1148 queries sampled for /dev and /test.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/fiqa
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
# Index beir/fiqa
indexer = pt.IterDictIndexer('./indices/beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/fiqa/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/fiqa/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/fiqa/train')
index_ref = pt.IndexRef.of('./indices/beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 57638, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 5500 }, "qrels": { "count": 14166, "fields": { "relevance": { "counts_by_value": { "1": 14166 } } } } }
A version of the Hotpot QA dataset for multi-hop question answering. Queries include all those in /train /dev and /test.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5233329, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 97852 } }
Random selection of the 5447 queries from /train.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/hotpotqa
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/dev docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/dev')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5233329, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 5447 }, "qrels": { "count": 10894, "fields": { "relevance": { "counts_by_value": { "1": 10894 } } } } }
Official dev set from HotpotQA, here used as a test set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/hotpotqa
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/test docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/test')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5233329, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 7405 }, "qrels": { "count": 14810, "fields": { "relevance": { "counts_by_value": { "1": 14810 } } } } }
Official train set, without the random selection of the 5447 queries used for /dev.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/hotpotqa
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/train docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
# Index beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/hotpotqa/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/hotpotqa/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/hotpotqa/train')
index_ref = pt.IndexRef.of('./indices/beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5233329, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 85000 }, "qrels": { "count": 170000, "fields": { "relevance": { "counts_by_value": { "1": 170000 } } } } }
A version of the MS MARCO passage ranking dataset. Includes queries from the /train, /dev, and /test sub-datasets.
Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/msmarco queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/msmarco docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 509962 } }
A version of the MS MARCO passage ranking dev set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/msmarco
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/dev docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/dev')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 6980 }, "qrels": { "count": 7437, "fields": { "relevance": { "counts_by_value": { "1": 7437 } } } } }
A version of the TREC Deep Learning 2019 set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/msmarco
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/test')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Craswell2019TrecDl, title={Overview of the TREC 2019 deep learning track}, author={Nick Craswell and Bhaskar Mitra and Emine Yilmaz and Daniel Campos and Ellen Voorhees}, booktitle={TREC 2019}, year={2019} } @inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 43 }, "qrels": { "count": 9260, "fields": { "relevance": { "counts_by_value": { "0": 5158, "1": 1601, "2": 1804, "3": 697 } } } } }
A version of the MS MARCO passage ranking train set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/msmarco
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
# Index beir/msmarco
indexer = pt.IterDictIndexer('./indices/beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/msmarco/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/msmarco/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/msmarco/train')
index_ref = pt.IndexRef.of('./indices/beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 8841823, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 502939 }, "qrels": { "count": 532751, "fields": { "relevance": { "counts_by_value": { "1": 532751 } } } } }
A version of the NF Corpus (Nutrition Facts). Queries use the "title" variant of the query, which here are often natural language questions. Queries include all those from /train /dev and /test.
Data pre-processing may be different than what is done in nfcorpus.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, url>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus queries
[query_id] [text] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 3633, "fields": { "doc_id": { "max_len": 8, "common_prefix": "MED-" } } }, "queries": { "count": 3237 } }
Combined dev set of NFCorpus.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/nfcorpus
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/dev docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/dev')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 3633, "fields": { "doc_id": { "max_len": 8, "common_prefix": "MED-" } } }, "queries": { "count": 324 }, "qrels": { "count": 11385, "fields": { "relevance": { "counts_by_value": { "2": 521, "1": 10864 } } } } }
Combined test set of NFCorpus.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/nfcorpus
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/test docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/test')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 3633, "fields": { "doc_id": { "max_len": 8, "common_prefix": "MED-" } } }, "queries": { "count": 323 }, "qrels": { "count": 12334, "fields": { "relevance": { "counts_by_value": { "2": 576, "1": 11758 } } } } }
Combined train set of NFCorpus.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/nfcorpus
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/train docs
[doc_id] [text] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
# Index beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nfcorpus/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/nfcorpus/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nfcorpus/train')
index_ref = pt.IndexRef.of('./indices/beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 3633, "fields": { "doc_id": { "max_len": 8, "common_prefix": "MED-" } } }, "queries": { "count": 2590 }, "qrels": { "count": 110575, "fields": { "relevance": { "counts_by_value": { "1": 110575 } } } } }
A version of the Natural Questions dev dataset.
Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nq")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/nq queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
index_ref = pt.IndexRef.of('./indices/beir_nq') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nq")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/nq docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
# Index beir/nq
indexer = pt.IterDictIndexer('./indices/beir_nq')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/nq")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/nq qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/nq')
index_ref = pt.IndexRef.of('./indices/beir_nq') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 2681468, "fields": { "doc_id": { "max_len": 10, "common_prefix": "doc" } } }, "queries": { "count": 3452 }, "qrels": { "count": 4201, "fields": { "relevance": { "counts_by_value": { "1": 4201 } } } } }
A version of the Quora duplicate question detection dataset (QQP). Includes queries from /dev and /test sets.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/quora")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/quora queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/quora")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/quora docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 522931, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 15000 } }
A 5,000 question subset of the original dataset, without overlaps in the other subsets.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/quora/dev queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/quora
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/quora/dev docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/quora/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/quora/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/quora/dev')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 522931, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 5000 }, "qrels": { "count": 7626, "fields": { "relevance": { "counts_by_value": { "1": 7626 } } } } }
A 10,000 question subset of the original dataset, without overlaps in the other subsets.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/quora/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/quora
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export beir/quora/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
# Index beir/quora
indexer = pt.IterDictIndexer('./indices/beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/quora/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/quora/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/quora/test')
index_ref = pt.IndexRef.of('./indices/beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 522931, "fields": { "doc_id": { "max_len": 6, "common_prefix": "" } } }, "queries": { "count": 10000 }, "qrels": { "count": 15675, "fields": { "relevance": { "counts_by_value": { "1": 15675 } } } } }
A version of the SciDocs dataset, used for citation retrieval.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, authors, year, cited_by, references>
You can find more details about the Python API here.
ir_datasets export beir/scidocs queries
[query_id] [text] [authors] [year] [cited_by] [references]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
index_ref = pt.IndexRef.of('./indices/beir_scidocs') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, authors, year, cited_by, references>
You can find more details about the Python API here.
ir_datasets export beir/scidocs docs
[doc_id] [text] [title] [authors] [year] [cited_by] [references]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
# Index beir/scidocs
indexer = pt.IterDictIndexer('./indices/beir_scidocs', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scidocs")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/scidocs qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scidocs')
index_ref = pt.IndexRef.of('./indices/beir_scidocs') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Cohan2020Scidocs, title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers", author = "Cohan, Arman and Feldman, Sergey and Beltagy, Iz and Downey, Doug and Weld, Daniel", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.207", doi = "10.18653/v1/2020.acl-main.207", pages = "2270--2282" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 25657, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 1000 }, "qrels": { "count": 29928, "fields": { "relevance": { "counts_by_value": { "1": 4928, "0": 25000 } } } } }
A version of the SciFact dataset, for fact verification. Queries include those form the /train and /test sets.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scifact")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/scifact queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scifact")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/scifact docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5183, "fields": { "doc_id": { "max_len": 9, "common_prefix": "" } } }, "queries": { "count": 1109 } }
The official dev set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/scifact/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/scifact
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/scifact/test docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scifact/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/scifact/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/test')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5183, "fields": { "doc_id": { "max_len": 9, "common_prefix": "" } } }, "queries": { "count": 300 }, "qrels": { "count": 339, "fields": { "relevance": { "counts_by_value": { "1": 339 } } } } }
The official train set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export beir/scifact/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from beir/scifact
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export beir/scifact/train docs
[doc_id] [text] [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
# Index beir/scifact
indexer = pt.IterDictIndexer('./indices/beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/scifact/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/scifact/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/scifact/train')
index_ref = pt.IndexRef.of('./indices/beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 5183, "fields": { "doc_id": { "max_len": 9, "common_prefix": "" } } }, "queries": { "count": 809 }, "qrels": { "count": 919, "fields": { "relevance": { "counts_by_value": { "1": 919 } } } } }
A version of the TREC COVID (complete) dataset, with titles and abstracts as documents. Queries are the question variant.
Data pre-processing may be different than what is done in cord19/trec-covid.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, query, narrative>
You can find more details about the Python API here.
ir_datasets export beir/trec-covid queries
[query_id] [text] [query] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
index_ref = pt.IndexRef.of('./indices/beir_trec-covid') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, url, pubmed_id>
You can find more details about the Python API here.
ir_datasets export beir/trec-covid docs
[doc_id] [text] [title] [url] [pubmed_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
# Index beir/trec-covid
indexer = pt.IterDictIndexer('./indices/beir_trec-covid')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url', 'pubmed_id'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/trec-covid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/trec-covid qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/trec-covid')
index_ref = pt.IndexRef.of('./indices/beir_trec-covid') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@article{Wang2020Cord19, title={CORD-19: The Covid-19 Open Research Dataset}, author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and K. Funk and Rodney Michael Kinney and Ziyang Liu and W. Merrill and P. Mooney and D. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and B. Stilson and A. Wade and K. Wang and Christopher Wilhelm and Boya Xie and D. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier}, journal={ArXiv}, year={2020} } @article{Voorhees2020TrecCovid, title={TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection}, author={E. Voorhees and Tasmeer Alam and Steven Bedrick and Dina Demner-Fushman and W. Hersh and Kyle Lo and Kirk Roberts and I. Soboroff and Lucy Lu Wang}, journal={ArXiv}, year={2020}, volume={abs/2005.04474} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 171332, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 66336, "fields": { "relevance": { "counts_by_value": { "2": 14217, "1": 10456, "0": 41661, "-1": 2 } } } } }
Original version of the Touchè-2020 dataset, for argument retrieval.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, description, narrative>
You can find more details about the Python API here.
ir_datasets export beir/webis-touche2020 queries
[query_id] [text] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, stance, url>
You can find more details about the Python API here.
ir_datasets export beir/webis-touche2020 docs
[doc_id] [text] [title] [stance] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
# Index beir/webis-touche2020
indexer = pt.IterDictIndexer('./indices/beir_webis-touche2020', meta={"docno": 39})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'stance', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/webis-touche2020 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Bondarenko2020Tuche, title={Overview of Touch{\'e} 2020: Argument Retrieval}, author={Alexander Bondarenko and Maik Fr{\"o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Christian Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle={CLEF}, year={2020} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 382545, "fields": { "doc_id": { "max_len": 39, "common_prefix": "" } } }, "queries": { "count": 49 }, "qrels": { "count": 2962, "fields": { "relevance": { "counts_by_value": { "4": 1006, "5": 398, "3": 628, "2": 195, "-2": 549, "1": 186 } } } } }
Version 2 of the Touchè-2020 dataset, for argument retrieval. This version uses the "corrected" version of the qrels, mapped to version 1 of the corpus.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020/v2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, description, narrative>
You can find more details about the Python API here.
ir_datasets export beir/webis-touche2020/v2 queries
[query_id] [text] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020/v2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title, stance, url>
You can find more details about the Python API here.
ir_datasets export beir/webis-touche2020/v2 docs
[doc_id] [text] [title] [stance] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
# Index beir/webis-touche2020/v2
indexer = pt.IterDictIndexer('./indices/beir_webis-touche2020_v2', meta={"docno": 39})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'stance', 'url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|
Examples:
import ir_datasets
dataset = ir_datasets.load("beir/webis-touche2020/v2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export beir/webis-touche2020/v2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:beir/webis-touche2020/v2')
index_ref = pt.IndexRef.of('./indices/beir_webis-touche2020_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Bondarenko2020Tuche, title={Overview of Touch{\'e} 2020: Argument Retrieval}, author={Alexander Bondarenko and Maik Fr{\"o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Christian Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle={CLEF}, year={2020} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{ "docs": { "count": 382545, "fields": { "doc_id": { "max_len": 39, "common_prefix": "" } } }, "queries": { "count": 49 }, "qrels": { "count": 2214, "fields": { "relevance": { "counts_by_value": { "0": 1282, "1": 296, "2": 636 } } } } }