ir_datasets
: ANTIQUE"ANTIQUE is a non-factoid quesiton answering dataset based on the questions and answers of Yahoo! Webscope L6."
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique')
# Index antique
indexer = pt.IterDictIndexer('./indices/antique')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.antique')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Hashemi2020Antique, title={ANTIQUE: A Non-Factoid Question Answering Benchmark}, author={Helia Hashemi and Mohammad Aliannejadi and Hamed Zamani and Bruce Croft}, booktitle={ECIR}, year={2020} }{ "docs": { "count": 403666, "fields": { "doc_id": { "max_len": 10, "common_prefix": "" } } } }
Official test set of the ANTIQUE dataset.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export antique/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/test')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.antique.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from antique
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export antique/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/test')
# Index antique
indexer = pt.IterDictIndexer('./indices/antique')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.antique.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | It is completely out of context or does not make any sense. | 1.6K | 24.9% |
2 | It does not answer the question or if it does, it provides anunreasonable answer, however, it is not out of context. Therefore, you cannot accept it as an answer to the question. | 2.4K | 36.7% |
3 | It can be an answer to the question, however, it is notsufficiently convincing. There should be an answer with much better quality for the question. | 1.2K | 18.2% |
4 | It looks reasonable and convincing. Its quality is on parwith or better than the "Possibly Correct Answer". Note that it does not have to provide the same answer as the "PossiblyCorrect Answer". | 1.3K | 20.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export antique/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:antique/test')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.antique.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Hashemi2020Antique, title={ANTIQUE: A Non-Factoid Question Answering Benchmark}, author={Helia Hashemi and Mohammad Aliannejadi and Hamed Zamani and Bruce Croft}, booktitle={ECIR}, year={2020} }{ "docs": { "count": 403666, "fields": { "doc_id": { "max_len": 10, "common_prefix": "" } } }, "queries": { "count": 200 }, "qrels": { "count": 6589, "fields": { "relevance": { "counts_by_value": { "4": 1334, "1": 1642, "2": 2417, "3": 1196 } } } } }
antique/test without a set of queries deemed by the authors of ANTIQUE to be "offensive (and noisy)."
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/test/non-offensive")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export antique/test/non-offensive queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/test/non-offensive')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.antique.test.non-offensive.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from antique
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/test/non-offensive")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export antique/test/non-offensive docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/test/non-offensive')
# Index antique
indexer = pt.IterDictIndexer('./indices/antique')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.antique.test.non-offensive')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | It is completely out of context or does not make any sense. | 1.4K | 24.5% |
2 | It does not answer the question or if it does, it provides anunreasonable answer, however, it is not out of context. Therefore, you cannot accept it as an answer to the question. | 2.1K | 36.5% |
3 | It can be an answer to the question, however, it is notsufficiently convincing. There should be an answer with much better quality for the question. | 1.0K | 18.2% |
4 | It looks reasonable and convincing. Its quality is on parwith or better than the "Possibly Correct Answer". Note that it does not have to provide the same answer as the "PossiblyCorrect Answer". | 1.2K | 20.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/test/non-offensive")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export antique/test/non-offensive qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:antique/test/non-offensive')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.antique.test.non-offensive.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Hashemi2020Antique, title={ANTIQUE: A Non-Factoid Question Answering Benchmark}, author={Helia Hashemi and Mohammad Aliannejadi and Hamed Zamani and Bruce Croft}, booktitle={ECIR}, year={2020} }{ "docs": { "count": 403666, "fields": { "doc_id": { "max_len": 10, "common_prefix": "" } } }, "queries": { "count": 176 }, "qrels": { "count": 5752, "fields": { "relevance": { "counts_by_value": { "4": 1195, "1": 1407, "2": 2101, "3": 1049 } } } } }
Official train set of the ANTIQUE dataset.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export antique/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/train')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.antique.train.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from antique
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export antique/train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/train')
# Index antique
indexer = pt.IterDictIndexer('./indices/antique')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.antique.train')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | It is completely out of context or does not make any sense. | 1.3K | 4.6% |
2 | It does not answer the question or if it does, it provides anunreasonable answer, however, it is not out of context. Therefore, you cannot accept it as an answer to the question. | 6.3K | 23.1% |
3 | It can be an answer to the question, however, it is notsufficiently convincing. There should be an answer with much better quality for the question. | 8.1K | 29.5% |
4 | It looks reasonable and convincing. Its quality is on parwith or better than the "Possibly Correct Answer". Note that it does not have to provide the same answer as the "PossiblyCorrect Answer". | 12K | 42.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export antique/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:antique/train')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.antique.train.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Hashemi2020Antique, title={ANTIQUE: A Non-Factoid Question Answering Benchmark}, author={Helia Hashemi and Mohammad Aliannejadi and Hamed Zamani and Bruce Croft}, booktitle={ECIR}, year={2020} }{ "docs": { "count": 403666, "fields": { "doc_id": { "max_len": 10, "common_prefix": "" } } }, "queries": { "count": 2426 }, "qrels": { "count": 27422, "fields": { "relevance": { "counts_by_value": { "4": 11733, "3": 8080, "2": 6337, "1": 1272 } } } } }
antique/train without the 200 queries used by antique/train/split200-valid.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train/split200-train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export antique/train/split200-train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/train/split200-train')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.antique.train.split200-train.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from antique
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train/split200-train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export antique/train/split200-train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/train/split200-train')
# Index antique
indexer = pt.IterDictIndexer('./indices/antique')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.antique.train.split200-train')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | It is completely out of context or does not make any sense. | 1.2K | 4.6% |
2 | It does not answer the question or if it does, it provides anunreasonable answer, however, it is not out of context. Therefore, you cannot accept it as an answer to the question. | 5.8K | 23.1% |
3 | It can be an answer to the question, however, it is notsufficiently convincing. There should be an answer with much better quality for the question. | 7.4K | 29.5% |
4 | It looks reasonable and convincing. Its quality is on parwith or better than the "Possibly Correct Answer". Note that it does not have to provide the same answer as the "PossiblyCorrect Answer". | 11K | 42.7% |
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train/split200-train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export antique/train/split200-train qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:antique/train/split200-train')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.antique.train.split200-train.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Hashemi2020Antique, title={ANTIQUE: A Non-Factoid Question Answering Benchmark}, author={Helia Hashemi and Mohammad Aliannejadi and Hamed Zamani and Bruce Croft}, booktitle={ECIR}, year={2020} }{ "docs": { "count": 403666, "fields": { "doc_id": { "max_len": 10, "common_prefix": "" } } }, "queries": { "count": 2226 }, "qrels": { "count": 25229, "fields": { "relevance": { "counts_by_value": { "4": 10782, "3": 7447, "2": 5829, "1": 1171 } } } } }
A held-out subset of 200 queries from antique/train. Use in conjunction with antique/train/split200-train.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train/split200-valid")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export antique/train/split200-valid queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/train/split200-valid')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.antique.train.split200-valid.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from antique
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train/split200-valid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export antique/train/split200-valid docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:antique/train/split200-valid')
# Index antique
indexer = pt.IterDictIndexer('./indices/antique')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.antique.train.split200-valid')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | It is completely out of context or does not make any sense. | 101 | 4.6% |
2 | It does not answer the question or if it does, it provides anunreasonable answer, however, it is not out of context. Therefore, you cannot accept it as an answer to the question. | 508 | 23.2% |
3 | It can be an answer to the question, however, it is notsufficiently convincing. There should be an answer with much better quality for the question. | 633 | 28.9% |
4 | It looks reasonable and convincing. Its quality is on parwith or better than the "Possibly Correct Answer". Note that it does not have to provide the same answer as the "PossiblyCorrect Answer". | 951 | 43.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("antique/train/split200-valid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export antique/train/split200-valid qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:antique/train/split200-valid')
index_ref = pt.IndexRef.of('./indices/antique') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.antique.train.split200-valid.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Hashemi2020Antique, title={ANTIQUE: A Non-Factoid Question Answering Benchmark}, author={Helia Hashemi and Mohammad Aliannejadi and Hamed Zamani and Bruce Croft}, booktitle={ECIR}, year={2020} }{ "docs": { "count": 403666, "fields": { "doc_id": { "max_len": 10, "common_prefix": "" } } }, "queries": { "count": 200 }, "qrels": { "count": 2193, "fields": { "relevance": { "counts_by_value": { "4": 951, "2": 508, "3": 633, "1": 101 } } } } }