← home
Github: datasets/kilt.py

ir_datasets: KILT

Index
  1. kilt
  2. kilt/codec
  3. kilt/codec/economics
  4. kilt/codec/history
  5. kilt/codec/politics

"kilt"

KILT is a corpus used for various "knowledge intensive language tasks".

docs
5.9M docs

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{petroni-etal-2021-kilt}

Bibtex:

@inproceedings{petroni-etal-2021-kilt, title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks", author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt{\"a}schel, Tim and Riedel, Sebastian}, booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = "jun", year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.200", doi = "10.18653/v1/2021.naacl-main.200", pages = "2523--2544", }
Metadata

"kilt/codec"

CODEC Entity Ranking sub-task.

queries
42 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec queries
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
5.9M docs

Inherits docs from kilt

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels
11K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant. This entity is not useful or on topic.7.1K62.3%
1Not Valuable. It is useful to understand what this entity is for understanding this topic.2.2K19.8%
2Somewhat valuable. It is important to understand what this entity is for understanding this topic.1.3K11.1%
3Very Valuable. It is absolutely critical to understand what this entity is for understanding this topic.777 6.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"kilt/codec/economics"

Subset of codec that only contains topics about economics.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/economics queries
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
5.9M docs

Inherits docs from kilt

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/economics docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels
2.0K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.660 33.5%
1Not Valuable. Consists of definitions or background.693 35.2%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.458 23.2%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.159 8.1%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/economics qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"kilt/codec/history"

Subset of codec that only contains topics about history.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/history queries
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
5.9M docs

Inherits docs from kilt

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/history docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels
2.0K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.998 49.3%
1Not Valuable. Consists of definitions or background.618 30.5%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.292 14.4%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.116 5.7%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/history qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"kilt/codec/politics"

Subset of codec that only contains topics about politics.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/politics queries
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
5.9M docs

Inherits docs from kilt

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/politics docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels
2.2K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.695 31.7%
1Not Valuable. Consists of definitions or background.899 41.0%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.457 20.8%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.141 6.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/politics qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata