← home
Github: datasets/kilt.py

ir_datasets: KILT

Index
  1. kilt
  2. kilt/codec
  3. kilt/codec/economics
  4. kilt/codec/history
  5. kilt/codec/politics

"kilt"

KILT is a corpus used for various "knowledge intensive language tasks".

docs
5.9M docs

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{petroni-etal-2021-kilt}

Bibtex:

@inproceedings{petroni-etal-2021-kilt, title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks", author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt{\"a}schel, Tim and Riedel, Sebastian}, booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = "jun", year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.200", doi = "10.18653/v1/2021.naacl-main.200", pages = "2523--2544", }
Metadata

"kilt/codec"

CODEC Entity Ranking sub-task.

queries
36 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec queries
[query_id]    [query]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
5.9M docs

Inherits docs from kilt

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels
10K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant. This entity is not useful or on topic.6.7K65.1%
1Not Valuable. It is useful to understand what this entity is for understanding this topic.1.9K18.7%
2Somewhat valuable. It is important to understand what this entity is for understanding this topic.1.0K10.1%
3Very Valuable. It is absolutely critical to understand what this entity is for understanding this topic.626 6.1%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Metadata

"kilt/codec/economics"

Subset of codec that only contains topics about economics.

queries
12 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/economics queries
[query_id]    [query]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
5.9M docs

Inherits docs from kilt

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/economics docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels
1.6K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.596 37.4%
1Not Valuable. Consists of definitions or background.545 34.2%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.330 20.7%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.121 7.6%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/economics qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Metadata

"kilt/codec/history"

Subset of codec that only contains topics about history.

queries
12 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/history queries
[query_id]    [query]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
5.9M docs

Inherits docs from kilt

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/history docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels
1.7K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.870 51.3%
1Not Valuable. Consists of definitions or background.509 30.0%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.235 13.9%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.81 4.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/history qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Metadata

"kilt/codec/politics"

Subset of codec that only contains topics about politics.

queries
12 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/politics queries
[query_id]    [query]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
5.9M docs

Inherits docs from kilt

Language: en

Document type:
KiltDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. text_pieces: Tuple[str, ...]
  5. anchors: Tuple[
    KiltDocAnchor: (namedtuple)
    1. text: str
    2. href: str
    3. paragraph_id: int
    4. start: int
    5. end: int
    , ...]
  6. categories: Tuple[str, ...]
  7. wikidata_id: str
  8. history_revid: str
  9. history_timestamp: str
  10. history_parentid: str
  11. history_pageid: str
  12. history_url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/politics docs
[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels
1.8K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.609 33.0%
1Not Valuable. Consists of definitions or background.765 41.5%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.359 19.5%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.110 6.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export kilt/codec/politics qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Metadata