ir_datasets : KILT

import ir_datasets
dataset = ir_datasets.load("kilt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI

ir_datasets export kilt docs



[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

\cite{petroni-etal-2021-kilt}

Bibtex:

@inproceedings{petroni-etal-2021-kilt, title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks", author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt{\"a}schel, Tim and Riedel, Sebastian}, booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = "jun", year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.200", doi = "10.18653/v1/2021.naacl-main.200", pages = "2523--2544", }

{
  "docs": {
    "count": 5903530,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  }
}

`"kilt/codec"`

CODEC Entity Ranking sub-task.

Task Repository
See also: codec, the document ranking subtask

42 queries

Language: en

Query type:

CodecQuery: (namedtuple)

query_id: str
query: str
domain: str
guidelines: str

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec queries



[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

5.9M docs

Inherits docs from kilt

Language: en

Document type:

KiltDoc: (namedtuple)

doc_id: str
title: str
text: str
text_pieces: Tuple[str, ...]
anchors: Tuple[
KiltDocAnchor: (namedtuple)
1. text: str
2. href: str
3. paragraph_id: int
4. start: int
5. end: int
, ...]
categories: Tuple[str, ...]
wikidata_id: str
history_revid: str
history_timestamp: str
history_parentid: str
history_pageid: str
history_url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec docs



[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels

11K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant. This entity is not useful or on topic.	`7.1K`	62.3%
1	Not Valuable. It is useful to understand what this entity is for understanding this topic.	`2.2K`	19.8%
2	Somewhat valuable. It is important to understand what this entity is for understanding this topic.	`1.3K`	11.1%
3	Very Valuable. It is absolutely critical to understand what this entity is for understanding this topic.	`777`	6.9%

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }

{
  "docs": {
    "count": 5903530,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 42
  },
  "qrels": {
    "count": 11323,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 7053,
          "2": 1252,
          "3": 777,
          "1": 2241
        }
      }
    }
  }
}

`"kilt/codec/economics"`

Subset of codec that only contains topics about economics.

14 queries

Language: en

Query type:

CodecQuery: (namedtuple)

query_id: str
query: str
domain: str
guidelines: str

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/economics queries



[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

5.9M docs

Inherits docs from kilt

Language: en

Document type:

KiltDoc: (namedtuple)

doc_id: str
title: str
text: str
text_pieces: Tuple[str, ...]
anchors: Tuple[
KiltDocAnchor: (namedtuple)
1. text: str
2. href: str
3. paragraph_id: int
4. start: int
5. end: int
, ...]
categories: Tuple[str, ...]
wikidata_id: str
history_revid: str
history_timestamp: str
history_parentid: str
history_pageid: str
history_url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/economics docs



[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels

2.0K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant. Not useful or on topic.	`660`	33.5%
1	Not Valuable. Consists of definitions or background.	`693`	35.2%
2	Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.	`458`	23.2%
3	Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.	`159`	8.1%

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/economics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/economics qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/economics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{mackie2022codec}

Bibtex:

{
  "docs": {
    "count": 5903530,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 14
  },
  "qrels": {
    "count": 1970,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 458,
          "0": 660,
          "1": 693,
          "3": 159
        }
      }
    }
  }
}

`"kilt/codec/history"`

Subset of codec that only contains topics about history.

14 queries

Language: en

Query type:

CodecQuery: (namedtuple)

query_id: str
query: str
domain: str
guidelines: str

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/history queries



[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

5.9M docs

Inherits docs from kilt

Language: en

Document type:

KiltDoc: (namedtuple)

doc_id: str
title: str
text: str
text_pieces: Tuple[str, ...]
anchors: Tuple[
KiltDocAnchor: (namedtuple)
1. text: str
2. href: str
3. paragraph_id: int
4. start: int
5. end: int
, ...]
categories: Tuple[str, ...]
wikidata_id: str
history_revid: str
history_timestamp: str
history_parentid: str
history_pageid: str
history_url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/history docs



[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels

2.0K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant. Not useful or on topic.	`998`	49.3%
1	Not Valuable. Consists of definitions or background.	`618`	30.5%
2	Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.	`292`	14.4%
3	Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.	`116`	5.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/history")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/history qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/history')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{mackie2022codec}

Bibtex:

{
  "docs": {
    "count": 5903530,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 14
  },
  "qrels": {
    "count": 2024,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 998,
          "1": 618,
          "2": 292,
          "3": 116
        }
      }
    }
  }
}

`"kilt/codec/politics"`

Subset of codec that only contains topics about politics.

14 queries

Language: en

Query type:

CodecQuery: (namedtuple)

query_id: str
query: str
domain: str
guidelines: str

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/politics queries



[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

5.9M docs

Inherits docs from kilt

Language: en

Document type:

KiltDoc: (namedtuple)

doc_id: str
title: str
text: str
text_pieces: Tuple[str, ...]
anchors: Tuple[
KiltDocAnchor: (namedtuple)
1. text: str
2. href: str
3. paragraph_id: int
4. start: int
5. end: int
, ...]
categories: Tuple[str, ...]
wikidata_id: str
history_revid: str
history_timestamp: str
history_parentid: str
history_pageid: str
history_url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, text_pieces, anchors, categories, wikidata_id, history_revid, history_timestamp, history_parentid, history_pageid, history_url>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/politics docs



[doc_id]    [title]    [text]    [text_pieces]    [anchors]    [categories]    [wikidata_id]    [history_revid]    [history_timestamp]    [history_parentid]    [history_pageid]    [history_url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
# Index kilt
indexer = pt.IterDictIndexer('./indices/kilt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'wikidata_id', 'history_revid', 'history_timestamp', 'history_parentid', 'history_pageid', 'history_url'])

You can find more details about PyTerrier indexing here.

qrels

2.2K qrels

Query relevance judgment type:

GenericQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant. Not useful or on topic.	`695`	31.7%
1	Not Valuable. Consists of definitions or background.	`899`	41.0%
2	Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.	`457`	20.8%
3	Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.	`141`	6.4%

Examples:

import ir_datasets
dataset = ir_datasets.load("kilt/codec/politics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI

ir_datasets export kilt/codec/politics qrels --format tsv



[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:kilt/codec/politics')
index_ref = pt.IndexRef.of('./indices/kilt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{mackie2022codec}

Bibtex: