ir_datasets
: KILTKILT is a corpus used for various "knowledge intensive language tasks".
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.kilt')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{petroni-etal-2021-kilt, title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks", author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt{\"a}schel, Tim and Riedel, Sebastian}, booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = "jun", year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.200", doi = "10.18653/v1/2021.naacl-main.200", pages = "2523--2544", }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } } }
CODEC Entity Ranking sub-task.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export kilt/codec queries
[query_id] [query] [domain] [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.kilt.codec.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from kilt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt/codec docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.kilt.codec')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not Relevant. This entity is not useful or on topic. | 7.1K | 62.3% |
1 | Not Valuable. It is useful to understand what this entity is for understanding this topic. | 2.2K | 19.8% |
2 | Somewhat valuable. It is important to understand what this entity is for understanding this topic. | 1.3K | 11.1% |
3 | Very Valuable. It is absolutely critical to understand what this entity is for understanding this topic. | 777 | 6.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export kilt/codec qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.kilt.codec.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 42 }, "qrels": { "count": 11323, "fields": { "relevance": { "counts_by_value": { "0": 7053, "2": 1252, "3": 777, "1": 2241 } } } } }
Subset of codec that only contains topics about economics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export kilt/codec/economics queries
[query_id] [query] [domain] [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.kilt.codec.economics.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from kilt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt/codec/economics docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.kilt.codec.economics')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not Relevant. Not useful or on topic. | 660 | 33.5% |
1 | Not Valuable. Consists of definitions or background. | 693 | 35.2% |
2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 458 | 23.2% |
3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 159 | 8.1% |
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export kilt/codec/economics qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.kilt.codec.economics.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 14 }, "qrels": { "count": 1970, "fields": { "relevance": { "counts_by_value": { "2": 458, "0": 660, "1": 693, "3": 159 } } } } }
Subset of codec that only contains topics about history.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export kilt/codec/history queries
[query_id] [query] [domain] [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.kilt.codec.history.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from kilt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt/codec/history docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.kilt.codec.history')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not Relevant. Not useful or on topic. | 998 | 49.3% |
1 | Not Valuable. Consists of definitions or background. | 618 | 30.5% |
2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 292 | 14.4% |
3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 116 | 5.7% |
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export kilt/codec/history qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.kilt.codec.history.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 14 }, "qrels": { "count": 2024, "fields": { "relevance": { "counts_by_value": { "0": 998, "1": 618, "2": 292, "3": 116 } } } } }
Subset of codec that only contains topics about politics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export kilt/codec/politics queries
[query_id] [query] [domain] [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.kilt.codec.politics.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from kilt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt/codec/politics docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.kilt.codec.politics')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not Relevant. Not useful or on topic. | 695 | 31.7% |
1 | Not Valuable. Consists of definitions or background. | 899 | 41.0% |
2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 457 | 20.8% |
3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 141 | 6.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export kilt/codec/politics qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.kilt.codec.politics.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 14 }, "qrels": { "count": 2192, "fields": { "relevance": { "counts_by_value": { "3": 141, "2": 457, "1": 899, "0": 695 } } } } }