← home
Github: datasets/trec_cast.py

ir_datasets: TREC CAsT (Conversational Assistance)

Index
  1. trec-cast
  2. trec-cast/v0
  3. trec-cast/v0/train
  4. trec-cast/v0/train/judged
  5. trec-cast/v1
  6. trec-cast/v1/2019
  7. trec-cast/v1/2019/judged
  8. trec-cast/v1/2020
  9. trec-cast/v1/2020/judged

Data Access Information

To use version 0 of the corpus, you need a copy of the Washington Post Collection, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

For the v0 corpus, the source file required is WashingtonPost.v2.tar.gz. ir_datasets expects the above file to be copied/linked under ~/.ir_datasets/wapo/WashingtonPost.v2.tar.gz.


"trec-cast"

The TREC Conversational Assistance Track (CAsT) is a benchmark for Conversational Information Seeking (CIS) models.

  • Documents: Passages from Wikipedia (TREC CAR or KILT), MS MARCO, and/or the Washington Post (depending on year)
  • Queries: raw utterences in sequence, manual/automatic re-writing of queries (depending on year)
  • Relevance: Deep judgments
  • Track Website

"trec-cast/v0"

Version 0 of the TREC CAsT corpus. This version uses documents from the Washington Post (version 2), TREC CAR (version 2), and MS MARCO passage (version 1).

This corpus was originally meant to be used for evaluation of the 2019 task, but the Washington Post corpus was not included for scoring in the final version due to "an error in the process led to ambiguous document ids," and Washington Post documents were removed from participating systems. As such, trec-cast/v1 (which doesn't include the Washington Post) should be used for the 2019 version of the task. However, this version still can be used for the training set (trec-cast/v0/train) or for replicating the original submissions to the track (prior to the removal of Washingotn Post documents).

docsCitationMetadata
48M docs

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.


"trec-cast/v0/train"

Training set provided by TREC CAsT 2019.

queriesdocsqrelsscoreddocsCitationMetadata
269 queries

Language: en

Query type:
Cast2019Query: (namedtuple)
  1. query_id: str
  2. raw_utterance: str
  3. topic_number: int
  4. turn_number: int
  5. topic_title: str
  6. topic_description: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>

You can find more details about the Python API here.


"trec-cast/v0/train/judged"

trec-cast/2019/train, but with queries that do not appear in the qrels removed.

queriesdocsqrelsscoreddocsCitationMetadata
120 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"trec-cast/v1"

Version 1 of the TREC CAsT corpus. This version uses documents from the TREC CAR (version 2) and MS MARCO passage (version 1). This version of the corpus was used for TREC CAsT 2019 and 2020.

docsCitationMetadata
39M docs

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.


"trec-cast/v1/2019"

Official evaluation set for TREC CAsT 2019.

queriesdocsqrelsscoreddocsCitationMetadata
479 queries

Language: en

Query type:
Cast2019Query: (namedtuple)
  1. query_id: str
  2. raw_utterance: str
  3. topic_number: int
  4. turn_number: int
  5. topic_title: str
  6. topic_description: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>

You can find more details about the Python API here.


"trec-cast/v1/2019/judged"

trec-cast/v1/2019, but with queries that do not appear in the qrels removed.

queriesdocsqrelsscoreddocsCitationMetadata
173 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"trec-cast/v1/2020"

Official evaluation set for TREC CAsT 2020.

queriesdocsqrelsCitationMetadata
216 queries

Language: en

Query type:
Cast2020Query: (namedtuple)
  1. query_id: str
  2. raw_utterance: str
  3. automatic_rewritten_utterance: str
  4. manual_rewritten_utterance: str
  5. manual_canonical_result_id: str
  6. topic_number: int
  7. turn_number: int

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, automatic_rewritten_utterance, manual_rewritten_utterance, manual_canonical_result_id, topic_number, turn_number>

You can find more details about the Python API here.


"trec-cast/v1/2020/judged"

trec-cast/v1/2020, but with queries that do not appear in the qrels removed.

queriesdocsqrelsCitationMetadata
208 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.