← home
Github: datasets/codec.py

ir_datasets: CODEC

Index
  1. codec
  2. codec/economics
  3. codec/history
  4. codec/politics

Data Access Information

To use this dataset, you need a copy the document corpus from here.

The process involves emailing a dataset author, who will provide instructions for downloading the dataset.

ir_datasets expects the source file to be copied/linked under ~/.ir_datasets/codec/v1/comets_documents.jsonl.


"codec"

CODEC Document Ranking sub-task.

  • Documents: curated web articles
  • Queries: challenging, entity-focused queries
  • Task Repository
  • See also: kilt/codec, the entity ranking subtask
queries
42 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI
ir_datasets export codec queries
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
730K docs

Language: en

Document type:
CodecDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url>

You can find more details about the Python API here.

CLI
ir_datasets export codec docs
[doc_id]    [title]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec')
# Index codec
indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url'])

You can find more details about PyTerrier indexing here.

qrels
6.2K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.2.4K38.0%
1Not Valuable. Consists of definitions or background.2.2K35.7%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.1.2K19.5%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.416 6.7%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export codec qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:codec')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"codec/economics"

Subset of codec that only contains topics about economics.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/economics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI
ir_datasets export codec/economics queries
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/economics')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
730K docs

Inherits docs from codec

Language: en

Document type:
CodecDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/economics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url>

You can find more details about the Python API here.

CLI
ir_datasets export codec/economics docs
[doc_id]    [title]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/economics')
# Index codec
indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url'])

You can find more details about PyTerrier indexing here.

qrels
2.0K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.660 33.5%
1Not Valuable. Consists of definitions or background.693 35.2%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.458 23.2%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.159 8.1%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/economics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export codec/economics qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:codec/economics')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"codec/history"

Subset of codec that only contains topics about history.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/history")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI
ir_datasets export codec/history queries
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/history')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
730K docs

Inherits docs from codec

Language: en

Document type:
CodecDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/history")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url>

You can find more details about the Python API here.

CLI
ir_datasets export codec/history docs
[doc_id]    [title]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/history')
# Index codec
indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url'])

You can find more details about PyTerrier indexing here.

qrels
2.0K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.998 49.3%
1Not Valuable. Consists of definitions or background.618 30.5%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.292 14.4%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.116 5.7%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/history")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export codec/history qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:codec/history')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata

"codec/politics"

Subset of codec that only contains topics about politics.

queries
14 queries

Language: en

Query type:
CodecQuery: (namedtuple)
  1. query_id: str
  2. query: str
  3. domain: str
  4. guidelines: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/politics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>

You can find more details about the Python API here.

CLI
ir_datasets export codec/politics queries
[query_id]    [query]    [domain]    [guidelines]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/politics')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
730K docs

Inherits docs from codec

Language: en

Document type:
CodecDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/politics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url>

You can find more details about the Python API here.

CLI
ir_datasets export codec/politics docs
[doc_id]    [title]    [text]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/politics')
# Index codec
indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url'])

You can find more details about PyTerrier indexing here.

qrels
2.2K qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.DefinitionCount%
0Not Relevant. Not useful or on topic.695 31.7%
1Not Valuable. Consists of definitions or background.899 41.0%
2Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge.457 20.8%
3Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background.141 6.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("codec/politics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export codec/politics qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:codec/politics')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{mackie2022codec}

Bibtex:

@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }
Metadata