ir_datasets: CODECTo use this dataset, you need a copy the document corpus from here.
The process involves emailing a dataset author, who will provide instructions for downloading the dataset.
ir_datasets expects the source file to be copied/linked under ~/.ir_datasets/codec/v1/comets_documents.jsonl.
CODEC Document Ranking sub-task.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codec")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export codec queries
[query_id]    [query]    [domain]    [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.codec.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codec")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url>
You can find more details about the Python API here.
ir_datasets export codec docs
[doc_id]    [title]    [text]    [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec')
# Index codec
indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codec')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Not Relevant. Not useful or on topic. | 2.4K | 38.0% | 
| 1 | Not Valuable. Consists of definitions or background. | 2.2K | 35.7% | 
| 2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 1.2K | 19.5% | 
| 3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 416 | 6.7% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("codec")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export codec qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:codec')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.codec.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{
  "docs": {
    "count": 729824,
    "fields": {
      "doc_id": {
        "max_len": 32,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 42
  },
  "qrels": {
    "count": 6186,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1207,
          "0": 2353,
          "1": 2210,
          "3": 416
        }
      }
    }
  }
}
Subset of codec that only contains topics about economics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/economics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export codec/economics queries
[query_id]    [query]    [domain]    [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/economics')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.codec.economics.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from codec
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/economics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url>
You can find more details about the Python API here.
ir_datasets export codec/economics docs
[doc_id]    [title]    [text]    [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/economics')
# Index codec
indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codec.economics')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Not Relevant. Not useful or on topic. | 660 | 33.5% | 
| 1 | Not Valuable. Consists of definitions or background. | 693 | 35.2% | 
| 2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 458 | 23.2% | 
| 3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 159 | 8.1% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/economics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export codec/economics qrels --format tsv
[query_id]    [doc_id]    [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:codec/economics')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.codec.economics.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{
  "docs": {
    "count": 729824,
    "fields": {
      "doc_id": {
        "max_len": 32,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 14
  },
  "qrels": {
    "count": 1970,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 458,
          "0": 660,
          "1": 693,
          "3": 159
        }
      }
    }
  }
}
Subset of codec that only contains topics about history.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/history")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export codec/history queries
[query_id]    [query]    [domain]    [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/history')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.codec.history.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from codec
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/history")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url>
You can find more details about the Python API here.
ir_datasets export codec/history docs
[doc_id]    [title]    [text]    [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/history')
# Index codec
indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codec.history')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Not Relevant. Not useful or on topic. | 998 | 49.3% | 
| 1 | Not Valuable. Consists of definitions or background. | 618 | 30.5% | 
| 2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 292 | 14.4% | 
| 3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 116 | 5.7% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/history")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export codec/history qrels --format tsv
[query_id]    [doc_id]    [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:codec/history')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.codec.history.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{
  "docs": {
    "count": 729824,
    "fields": {
      "doc_id": {
        "max_len": 32,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 14
  },
  "qrels": {
    "count": 2024,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 998,
          "1": 618,
          "2": 292,
          "3": 116
        }
      }
    }
  }
}
Subset of codec that only contains topics about politics.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/politics")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, domain, guidelines>
You can find more details about the Python API here.
ir_datasets export codec/politics queries
[query_id]    [query]    [domain]    [guidelines]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/politics')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.codec.politics.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from codec
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/politics")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url>
You can find more details about the Python API here.
ir_datasets export codec/politics docs
[doc_id]    [title]    [text]    [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:codec/politics')
# Index codec
indexer = pt.IterDictIndexer('./indices/codec', meta={"docno": 32})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'text', 'url'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codec.politics')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Not Relevant. Not useful or on topic. | 695 | 31.7% | 
| 1 | Not Valuable. Consists of definitions or background. | 899 | 41.0% | 
| 2 | Somewhat Valuable. Includes valuable topic-specific arguments, evidence, or knowledge. | 457 | 20.8% | 
| 3 | Very Valuable. Includes central topic-specific arguments, evidence, or knowledge. This does not include general definitions or background. | 141 | 6.4% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("codec/politics")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export codec/politics qrels --format tsv
[query_id]    [doc_id]    [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:codec/politics')
index_ref = pt.IndexRef.of('./indices/codec') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.codec.politics.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{mackie2022codec, title={CODEC: Complex Document and Entity Collection}, author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery}, booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval}, year={2022} }{
  "docs": {
    "count": 729824,
    "fields": {
      "doc_id": {
        "max_len": 32,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 14
  },
  "qrels": {
    "count": 2192,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "3": 141,
          "2": 457,
          "1": 899,
          "0": 695
        }
      }
    }
  }
}