← home
Github: datasets/trec_cast.py

ir_datasets: TREC CAsT (Conversational Assistance)

Index
  1. trec-cast
  2. trec-cast/v0
  3. trec-cast/v0/train
  4. trec-cast/v0/train/judged
  5. trec-cast/v1
  6. trec-cast/v1/2019
  7. trec-cast/v1/2019/judged
  8. trec-cast/v1/2020
  9. trec-cast/v1/2020/judged

Data Access Information

To use version 0 of the corpus, you need a copy of the Washington Post Collection, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

For the v0 corpus, the source file required is WashingtonPost.v2.tar.gz. ir_datasets expects the above file to be copied/linked under ~/.ir_datasets/wapo/WashingtonPost.v2.tar.gz.


"trec-cast"

The TREC Conversational Assistance Track (CAsT) is a benchmark for Conversational Information Seeking (CIS) models.

  • Documents: Passages from Wikipedia (TREC CAR or KILT), MS MARCO, and/or the Washington Post (depending on year)
  • Queries: raw utterences in sequence, manual/automatic re-writing of queries (depending on year)
  • Relevance: Deep judgments
  • Track Website

"trec-cast/v0"

Version 0 of the TREC CAsT corpus. This version uses documents from the Washington Post (version 2), TREC CAR (version 2), and MS MARCO passage (version 1).

This corpus was originally meant to be used for evaluation of the 2019 task, but the Washington Post corpus was not included for scoring in the final version due to "an error in the process led to ambiguous document ids," and Washington Post documents were removed from participating systems. As such, trec-cast/v1 (which doesn't include the Washington Post) should be used for the 2019 version of the task. However, this version still can be used for the training set (trec-cast/v0/train) or for replicating the original submissions to the track (prior to the removal of Washingotn Post documents).

docs
48M docs

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0 docs
[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v0')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }
Metadata

"trec-cast/v0/train"

Training set provided by TREC CAsT 2019.

queries
269 queries

Language: en

Query type:
Cast2019Query: (namedtuple)
  1. query_id: str
  2. raw_utterance: str
  3. topic_number: int
  4. turn_number: int
  5. topic_title: str
  6. topic_description: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0/train queries
[query_id]    [raw_utterance]    [topic_number]    [turn_number]    [topic_title]    [topic_description]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v0.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
48M docs

Inherits docs from trec-cast/v0

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0/train docs
[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v0.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
2.4K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant1.8K73.3%
1relevant329 13.7%
2very relevant311 13.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v0.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
269K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0/train scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.trec-cast.v0.train.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Citation

ir_datasets.bib:

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }
Metadata

"trec-cast/v0/train/judged"

trec-cast/2019/train, but with queries that do not appear in the qrels removed.

queries
120 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0/train/judged queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v0.train.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
48M docs

Inherits docs from trec-cast/v0

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0/train/judged docs
[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v0.train.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
2.4K qrels

Inherits qrels from trec-cast/v0/train

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant1.8K73.3%
1relevant329 13.7%
2very relevant311 13.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0/train/judged qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v0.train.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
120K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v0/train/judged scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.trec-cast.v0.train.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Citation

ir_datasets.bib:

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }
Metadata

"trec-cast/v1"

Version 1 of the TREC CAsT corpus. This version uses documents from the TREC CAR (version 2) and MS MARCO passage (version 1). This version of the corpus was used for TREC CAsT 2019 and 2020.

docs
39M docs

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1 docs
[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }
Metadata

"trec-cast/v1/2019"

Official evaluation set for TREC CAsT 2019.

queries
479 queries

Language: en

Query type:
Cast2019Query: (namedtuple)
  1. query_id: str
  2. raw_utterance: str
  3. topic_number: int
  4. turn_number: int
  5. topic_title: str
  6. topic_description: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2019 queries
[query_id]    [raw_utterance]    [topic_number]    [turn_number]    [topic_title]    [topic_description]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v1.2019.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
39M docs

Inherits docs from trec-cast/v1

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2019 docs
[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1.2019')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
29K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query.21K72.3%
1Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing).2.9K9.8%
2Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context.2.2K7.3%
3Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information.1.5K5.0%
4Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information.1.6K5.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2019 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v1.2019.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
479K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2019 scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.trec-cast.v1.2019.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Citation

ir_datasets.bib:

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }
Metadata

"trec-cast/v1/2019/judged"

trec-cast/v1/2019, but with queries that do not appear in the qrels removed.

queries
173 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2019/judged queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v1.2019.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
39M docs

Inherits docs from trec-cast/v1

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2019/judged docs
[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1.2019.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
29K qrels

Inherits qrels from trec-cast/v1/2019

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query.21K72.3%
1Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing).2.9K9.8%
2Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context.2.2K7.3%
3Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information.1.5K5.0%
4Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information.1.6K5.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2019/judged qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v1.2019.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

scoreddocs
173K scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2019/judged scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

XPM-IR
import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.trec-cast.v1.2019.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

Citation

ir_datasets.bib:

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }
Metadata

"trec-cast/v1/2020"

Official evaluation set for TREC CAsT 2020.

queries
216 queries

Language: en

Query type:
Cast2020Query: (namedtuple)
  1. query_id: str
  2. raw_utterance: str
  3. automatic_rewritten_utterance: str
  4. manual_rewritten_utterance: str
  5. manual_canonical_result_id: str
  6. topic_number: int
  7. turn_number: int

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, automatic_rewritten_utterance, manual_rewritten_utterance, manual_canonical_result_id, topic_number, turn_number>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2020 queries
[query_id]    [raw_utterance]    [automatic_rewritten_utterance]    [manual_rewritten_utterance]    [manual_canonical_result_id]    [topic_number]    [turn_number]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v1.2020.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
39M docs

Inherits docs from trec-cast/v1

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2020 docs
[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1.2020')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
40K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query.34K83.5%
1Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing).2.7K6.7%
2Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context.1.8K4.5%
3Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information.1.4K3.5%
4Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information.731 1.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2020 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v1.2020.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Dalton2020Cast}

Bibtex:

@inproceedings{Dalton2020Cast, title={CAsT 2020: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2020} }
Metadata

"trec-cast/v1/2020/judged"

trec-cast/v1/2020, but with queries that do not appear in the qrels removed.

queries
208 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2020/judged queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v1.2020.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
39M docs

Inherits docs from trec-cast/v1

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2020/judged docs
[doc_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1.2020.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
40K qrels

Inherits qrels from trec-cast/v1/2020

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query.34K83.5%
1Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing).2.7K6.7%
2Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context.1.8K4.5%
3Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information.1.4K3.5%
4Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information.731 1.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export trec-cast/v1/2020/judged qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v1.2020.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Dalton2020Cast}

Bibtex:

@inproceedings{Dalton2020Cast, title={CAsT 2020: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2020} }
Metadata