ir_datasets
: KILTKILT is a corpus used for various "knowledge intensive language tasks".
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{petroni-etal-2021-kilt, title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks", author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt{\"a}schel, Tim and Riedel, Sebastian}, booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = "jun", year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.200", doi = "10.18653/v1/2021.naacl-main.200", pages = "2523--2544", }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } } }
CODEC Entity Ranking sub-task.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export kilt/codec queries
[query_id] [query] [domain] [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from kilt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt/codec docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not Relevant. This entity is not useful or on topic. | 7.1K | 62.3% |
1 | Not Valuable. It is useful to understand what this entity is for understanding this topic. | 2.2K | 19.8% |
2 | Somewhat valuable. It is important to understand what this entity is for understanding this topic. | 1.3K | 11.1% |
3 | Very Valuable. It is absolutely critical to understand what this entity is for understanding this topic. | 777 | 6.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export kilt/codec qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 42 }, "qrels": { "count": 11323, "fields": { "relevance": { "counts_by_value": { "0": 7053, "2": 1252, "3": 777, "1": 2241 } } } } }
Subset of codec that only contains topics about economics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export kilt/codec/economics queries
[query_id] [query] [domain] [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from kilt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt/codec/economics docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not Relevant. Not useful or on topic. | 660 | 33.5% |
1 | Not Valuable. Consists of definitions or background. | 693 | 35.2% |
2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 458 | 23.2% |
3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 159 | 8.1% |
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export kilt/codec/economics qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 14 }, "qrels": { "count": 1970, "fields": { "relevance": { "counts_by_value": { "2": 458, "0": 660, "1": 693, "3": 159 } } } } }
Subset of codec that only contains topics about history.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export kilt/codec/history queries
[query_id] [query] [domain] [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from kilt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt/codec/history docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not Relevant. Not useful or on topic. | 998 | 49.3% |
1 | Not Valuable. Consists of definitions or background. | 618 | 30.5% |
2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 292 | 14.4% |
3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 116 | 5.7% |
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export kilt/codec/history qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 14 }, "qrels": { "count": 2024, "fields": { "relevance": { "counts_by_value": { "0": 998, "1": 618, "2": 292, "3": 116 } } } } }
Subset of codec that only contains topics about politics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export kilt/codec/politics queries
[query_id] [query] [domain] [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from kilt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>
You can find more details about the Python API here.
ir_datasets export kilt/codec/politics docs
[doc_id] [title] [text] [text_pieces] [anchors] [categories] [wikidata_id] [history_revid] [history_timestamp] [history_parentid] [history_pageid] [history_url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Not Relevant. Not useful or on topic. | 695 | 31.7% |
1 | Not Valuable. Consists of definitions or background. | 899 | 41.0% |
2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 457 | 20.8% |
3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 141 | 6.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export kilt/codec/politics qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{ "docs": { "count": 5903530, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 14 }, "qrels": { "count": 2192, "fields": { "relevance": { "counts_by_value": { "3": 141, "2": 457, "1": 899, "0": 695 } } } } }