ir_datasets: Nano Beir (benchmark suite)Nano Beir is a smaller version (max 50 queries per benchmark) of the Beir suite of benchmarks to test zero-shot transfer.
Bibtex:
@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }A version of the ArguAna Counterargs dataset, for argument retrieval.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/arguana")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/arguana queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/arguana')
index_ref = pt.IndexRef.of('./indices/nano-beir_arguana') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.arguana.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/arguana")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/arguana docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/arguana')
# Index nano-beir/arguana
indexer = pt.IterDictIndexer('./indices/nano-beir_arguana', meta={"docno": 47})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.arguana')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 50 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/arguana")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/arguana qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/arguana')
index_ref = pt.IndexRef.of('./indices/nano-beir_arguana') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.arguana.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Wachsmuth2018Arguana, author = "Wachsmuth, Henning and Syed, Shahbaz and Stein, Benno", title = "Retrieval of the Best Counterargument without Prior Topic Knowledge", booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", year = "2018", publisher = "Association for Computational Linguistics", location = "Melbourne, Australia", pages = "241--251", url = "http://aclweb.org/anthology/P18-1023" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 3635,
"fields": {
"doc_id": {
"max_len": 47,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 50,
"fields": {
"relevance": {
"counts_by_value": {
"1": 50
}
}
}
}
}
A version of the CLIMATE-FEVER dataset, for fact verification on claims about climate.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/climate-fever")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/climate-fever queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/climate-fever')
index_ref = pt.IndexRef.of('./indices/nano-beir_climate-fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.climate-fever.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/climate-fever")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/climate-fever docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/climate-fever')
# Index nano-beir/climate-fever
indexer = pt.IterDictIndexer('./indices/nano-beir_climate-fever', meta={"docno": 130})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.climate-fever')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 148 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/climate-fever")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/climate-fever qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/climate-fever')
index_ref = pt.IndexRef.of('./indices/nano-beir_climate-fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.climate-fever.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Diggelmann2020CLIMATEFEVERAD, title={CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims}, author={T. Diggelmann and Jordan L. Boyd-Graber and Jannis Bulian and Massimiliano Ciaramita and Markus Leippold}, journal={ArXiv}, year={2020}, volume={abs/2012.00614} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 3408,
"fields": {
"doc_id": {
"max_len": 130,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 148,
"fields": {
"relevance": {
"counts_by_value": {
"1": 148
}
}
}
}
}
A version of the DBPedia-Entity-v2 dataset for entity retrieval.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/dbpedia-entity")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/dbpedia-entity queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/dbpedia-entity')
index_ref = pt.IndexRef.of('./indices/nano-beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.dbpedia-entity.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/dbpedia-entity")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/dbpedia-entity docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/dbpedia-entity')
# Index nano-beir/dbpedia-entity
indexer = pt.IterDictIndexer('./indices/nano-beir_dbpedia-entity', meta={"docno": 108})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.dbpedia-entity')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 1.2K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/dbpedia-entity")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/dbpedia-entity qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/dbpedia-entity')
index_ref = pt.IndexRef.of('./indices/nano-beir_dbpedia-entity') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.dbpedia-entity.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Hasibi2017DBpediaEntityVA, title={DBpedia-Entity v2: A Test Collection for Entity Search}, author={Faegheh Hasibi and Fedor Nikolaev and Chenyan Xiong and K. Balog and S. E. Bratsberg and Alexander Kotov and J. Callan}, journal={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2017} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 6045,
"fields": {
"doc_id": {
"max_len": 108,
"common_prefix": "
A version of the FEVER dataset for fact verification.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/fever")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/fever queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/fever')
index_ref = pt.IndexRef.of('./indices/nano-beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.fever.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/fever")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/fever docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/fever')
# Index nano-beir/fever
indexer = pt.IterDictIndexer('./indices/nano-beir_fever', meta={"docno": 88})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.fever')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 57 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/fever")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/fever qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/fever')
index_ref = pt.IndexRef.of('./indices/nano-beir_fever') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.fever.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Thorne2018Fever, title = "{FEVER}: a Large-scale Dataset for Fact Extraction and {VER}ification", author = "Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/N18-1074", doi = "10.18653/v1/N18-1074", pages = "809--819" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 4996,
"fields": {
"doc_id": {
"max_len": 88,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 57,
"fields": {
"relevance": {
"counts_by_value": {
"1": 57
}
}
}
}
}
A version of the FIQA-2018 dataset (financial opinion question answering).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/fiqa")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/fiqa queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/fiqa')
index_ref = pt.IndexRef.of('./indices/nano-beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.fiqa.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/fiqa")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/fiqa docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/fiqa')
# Index nano-beir/fiqa
indexer = pt.IterDictIndexer('./indices/nano-beir_fiqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.fiqa')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 123 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/fiqa")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/fiqa qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/fiqa')
index_ref = pt.IndexRef.of('./indices/nano-beir_fiqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.fiqa.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Maia2018Fiqa, title={WWW'18 Open Challenge: Financial Opinion Mining and Question Answering}, author={Macedo Maia and S. Handschuh and A. Freitas and Brian Davis and R. McDermott and M. Zarrouk and A. Balahur}, journal={Companion Proceedings of the The Web Conference 2018}, year={2018} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 4598,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 123,
"fields": {
"relevance": {
"counts_by_value": {
"1": 123
}
}
}
}
}
A version of the Hotpot QA dataset for multi-hop question answering.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/hotpotqa")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/hotpotqa queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/hotpotqa')
index_ref = pt.IndexRef.of('./indices/nano-beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.hotpotqa.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/hotpotqa")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/hotpotqa docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/hotpotqa')
# Index nano-beir/hotpotqa
indexer = pt.IterDictIndexer('./indices/nano-beir_hotpotqa')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.hotpotqa')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 100 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/hotpotqa")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/hotpotqa qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/hotpotqa')
index_ref = pt.IndexRef.of('./indices/nano-beir_hotpotqa') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.hotpotqa.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Yang2018Hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 5090,
"fields": {
"doc_id": {
"max_len": 8,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 100,
"fields": {
"relevance": {
"counts_by_value": {
"1": 100
}
}
}
}
}
A version of the MS MARCO passage ranking dataset.
Note that this version differs from msmarco-passage, in that it does not correct the encoding problems in the source documents.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/msmarco")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/msmarco queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/msmarco')
index_ref = pt.IndexRef.of('./indices/nano-beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.msmarco.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/msmarco")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/msmarco docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/msmarco')
# Index nano-beir/msmarco
indexer = pt.IterDictIndexer('./indices/nano-beir_msmarco')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.msmarco')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 50 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/msmarco")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/msmarco qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/msmarco')
index_ref = pt.IndexRef.of('./indices/nano-beir_msmarco') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.msmarco.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Bajaj2016Msmarco, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 5043,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 50,
"fields": {
"relevance": {
"counts_by_value": {
"1": 50
}
}
}
}
}
A version of the NF Corpus (Nutrition Facts).
Data pre-processing may be different than what is done in nfcorpus.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/nfcorpus")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/nfcorpus queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/nfcorpus')
index_ref = pt.IndexRef.of('./indices/nano-beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.nfcorpus.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/nfcorpus")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/nfcorpus docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/nfcorpus')
# Index nano-beir/nfcorpus
indexer = pt.IterDictIndexer('./indices/nano-beir_nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.nfcorpus')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 2.5K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/nfcorpus")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/nfcorpus qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/nfcorpus')
index_ref = pt.IndexRef.of('./indices/nano-beir_nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.nfcorpus.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 2953,
"fields": {
"doc_id": {
"max_len": 8,
"common_prefix": "MED-"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 2518,
"fields": {
"relevance": {
"counts_by_value": {
"1": 2518
}
}
}
}
}
A version of the Natural Questions dev dataset.
Data pre-processing differs both from what is done in natural-questions and dpr-w100/natural-questions, especially with respect to the document collection and filtering conducted on the queries. See the Beir paper for details.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/nq")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/nq queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/nq')
index_ref = pt.IndexRef.of('./indices/nano-beir_nq') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.nq.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/nq")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/nq docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/nq')
# Index nano-beir/nq
indexer = pt.IterDictIndexer('./indices/nano-beir_nq')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.nq')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 57 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/nq")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/nq qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/nq')
index_ref = pt.IndexRef.of('./indices/nano-beir_nq') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.nq.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 5035,
"fields": {
"doc_id": {
"max_len": 10,
"common_prefix": "doc"
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 57,
"fields": {
"relevance": {
"counts_by_value": {
"1": 57
}
}
}
}
}
A version of the Quora duplicate question detection dataset (QQP).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/quora")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/quora queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/quora')
index_ref = pt.IndexRef.of('./indices/nano-beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.quora.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/quora")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/quora docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/quora')
# Index nano-beir/quora
indexer = pt.IterDictIndexer('./indices/nano-beir_quora')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.quora')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 70 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/quora")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/quora qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/quora')
index_ref = pt.IndexRef.of('./indices/nano-beir_quora') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.quora.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 5046,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 70,
"fields": {
"relevance": {
"counts_by_value": {
"1": 70
}
}
}
}
}
A version of the SciDocs dataset, used for citation retrieval.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/scidocs")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/scidocs queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/scidocs')
index_ref = pt.IndexRef.of('./indices/nano-beir_scidocs') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.scidocs.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/scidocs")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/scidocs docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/scidocs')
# Index nano-beir/scidocs
indexer = pt.IterDictIndexer('./indices/nano-beir_scidocs', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.scidocs')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 244 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/scidocs")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/scidocs qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/scidocs')
index_ref = pt.IndexRef.of('./indices/nano-beir_scidocs') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.scidocs.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Cohan2020Scidocs, title = "{SPECTER}: Document-level Representation Learning using Citation-informed Transformers", author = "Cohan, Arman and Feldman, Sergey and Beltagy, Iz and Downey, Doug and Weld, Daniel", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.207", doi = "10.18653/v1/2020.acl-main.207", pages = "2270--2282" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 2210,
"fields": {
"doc_id": {
"max_len": 40,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 244,
"fields": {
"relevance": {
"counts_by_value": {
"1": 244
}
}
}
}
}
A version of the SciFact dataset, for fact verification.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/scifact")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/scifact queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/scifact')
index_ref = pt.IndexRef.of('./indices/nano-beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.scifact.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/scifact")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/scifact docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/scifact')
# Index nano-beir/scifact
indexer = pt.IterDictIndexer('./indices/nano-beir_scifact')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.scifact')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 56 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/scifact")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/scifact qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/scifact')
index_ref = pt.IndexRef.of('./indices/nano-beir_scifact') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.scifact.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Wadden2020Scifact, title = "Fact or Fiction: Verifying Scientific Claims", author = "Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-main.609", doi = "10.18653/v1/2020.emnlp-main.609", pages = "7534--7550" } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 2919,
"fields": {
"doc_id": {
"max_len": 9,
"common_prefix": ""
}
}
},
"queries": {
"count": 50
},
"qrels": {
"count": 56,
"fields": {
"relevance": {
"counts_by_value": {
"1": 56
}
}
}
}
}
Original version of the Touchè-2020 dataset, for argument retrieval.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/webis-touche2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/webis-touche2020 queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/webis-touche2020')
index_ref = pt.IndexRef.of('./indices/nano-beir_webis-touche2020') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.nano-beir.webis-touche2020.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/webis-touche2020")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export nano-beir/webis-touche2020 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nano-beir/webis-touche2020')
# Index nano-beir/webis-touche2020
indexer = pt.IterDictIndexer('./indices/nano-beir_webis-touche2020', meta={"docno": 39})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.nano-beir.webis-touche2020')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | relevant | 932 | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nano-beir/webis-touche2020")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nano-beir/webis-touche2020 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nano-beir/webis-touche2020')
index_ref = pt.IndexRef.of('./indices/nano-beir_webis-touche2020') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.nano-beir.webis-touche2020.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Bondarenko2020Tuche, title={Overview of Touch{\'e} 2020: Argument Retrieval}, author={Alexander Bondarenko and Maik Fr{\"o}be and Meriem Beloucif and Lukas Gienapp and Yamen Ajjour and Alexander Panchenko and Christian Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle={CLEF}, year={2020} } @article{Thakur2021Beir, title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models", author = "Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.08663", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.08663", }{
"docs": {
"count": 5745,
"fields": {
"doc_id": {
"max_len": 39,
"common_prefix": ""
}
}
},
"queries": {
"count": 49
},
"qrels": {
"count": 932,
"fields": {
"relevance": {
"counts_by_value": {
"1": 932
}
}
}
}
}