ir_datasets : TREC CAsT (Conversational Assistance)

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v0')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }

{
  "docs": {
    "count": 47696605,
    "fields": {
      "doc_id": {
        "max_len": 46,
        "common_prefix": ""
      }
    }
  }
}

`"trec-cast/v0/train"`

Training set provided by TREC CAsT 2019.

269 queries

Language: en

Query type:

Cast2019Query: (namedtuple)

query_id: str
raw_utterance: str
topic_number: int
turn_number: int
topic_title: str
topic_description: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0/train queries



[query_id]    [raw_utterance]    [topic_number]    [turn_number]    [topic_title]    [topic_description]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v0.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

48M docs

Inherits docs from trec-cast/v0

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0/train docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v0.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

2.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`1.8K`	73.3%
1	relevant	`329`	13.7%
2	very relevant	`311`	13.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v0.train.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

269K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0/train scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.trec-cast.v0.train.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }

{
  "docs": {
    "count": 47696605,
    "fields": {
      "doc_id": {
        "max_len": 46,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 269
  },
  "qrels": {
    "count": 2399,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 311,
          "0": 1759,
          "1": 329
        }
      }
    }
  },
  "scoreddocs": {
    "count": 269000
  }
}

`"trec-cast/v0/train/judged"`

trec-cast/2019/train, but with queries that do not appear in the qrels removed.

120 queries

Language: en

Query type:

Cast2019Query: (namedtuple)

query_id: str
raw_utterance: str
topic_number: int
turn_number: int
topic_title: str
topic_description: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0/train/judged queries



[query_id]    [raw_utterance]    [topic_number]    [turn_number]    [topic_title]    [topic_description]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v0.train.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

48M docs

Inherits docs from trec-cast/v0

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0/train/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v0.train.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

2.4K qrels

Inherits qrels from trec-cast/v0/train

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`1.8K`	73.3%
1	relevant	`329`	13.7%
2	very relevant	`311`	13.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0/train/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v0.train.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

120K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v0/train/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.trec-cast.v0.train.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }

{
  "docs": {
    "count": 47696605,
    "fields": {
      "doc_id": {
        "max_len": 46,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 120
  },
  "qrels": {
    "count": 2399,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 311,
          "0": 1759,
          "1": 329
        }
      }
    }
  },
  "scoreddocs": {
    "count": 120000
  }
}

`"trec-cast/v1"`

Version 1 of the TREC CAsT corpus. This version uses documents from the TREC CAR (version 2) and MS MARCO passage (version 1). This version of the corpus was used for TREC CAsT 2019 and 2020.

Task Overview Paper

docs

39M docs

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }

{
  "docs": {
    "count": 38622444,
    "fields": {
      "doc_id": {
        "max_len": 44,
        "common_prefix": ""
      }
    }
  }
}

`"trec-cast/v1/2019"`

Official evaluation set for TREC CAsT 2019.

479 queries

Language: en

Query type:

Cast2019Query: (namedtuple)

query_id: str
raw_utterance: str
topic_number: int
turn_number: int
topic_title: str
topic_description: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2019 queries



[query_id]    [raw_utterance]    [topic_number]    [turn_number]    [topic_title]    [topic_description]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v1.2019.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

39M docs

Inherits docs from trec-cast/v1

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2019 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1.2019')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

29K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query.	`21K`	72.3%
1	Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing).	`2.9K`	9.8%
2	Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context.	`2.2K`	7.3%
3	Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information.	`1.5K`	5.0%
4	Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information.	`1.6K`	5.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2019 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v1.2019.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

479K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2019 scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.trec-cast.v1.2019.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }

{
  "docs": {
    "count": 38622444,
    "fields": {
      "doc_id": {
        "max_len": 44,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 479
  },
  "qrels": {
    "count": 29350,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 21230,
          "1": 2889,
          "2": 2157,
          "3": 1456,
          "4": 1618
        }
      }
    }
  },
  "scoreddocs": {
    "count": 479000
  }
}

`"trec-cast/v1/2019/judged"`

trec-cast/v1/2019, but with queries that do not appear in the qrels removed.

173 queries

Language: en

Query type:

Cast2019Query: (namedtuple)

query_id: str
raw_utterance: str
topic_number: int
turn_number: int
topic_title: str
topic_description: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2019/judged queries



[query_id]    [raw_utterance]    [topic_number]    [turn_number]    [topic_title]    [topic_description]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v1.2019.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

39M docs

Inherits docs from trec-cast/v1

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2019/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1.2019.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

29K qrels

Inherits qrels from trec-cast/v1/2019

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query.	`21K`	72.3%
1	Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing).	`2.9K`	9.8%
2	Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context.	`2.2K`	7.3%
3	Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information.	`1.5K`	5.0%
4	Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information.	`1.6K`	5.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2019/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v1.2019.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

173K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2019/judged scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
dataset.get_results()

You can find more details about PyTerrier dataset API here.

import datamaestro # Supposes experimaestro-ir be installed

run = datamaestro.prepare_dataset('irds.trec-cast.v1.2019.judged.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun

\cite{Dalton2019Cast}

Bibtex:

@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }

{
  "docs": {
    "count": 38622444,
    "fields": {
      "doc_id": {
        "max_len": 44,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 173
  },
  "qrels": {
    "count": 29350,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 21230,
          "1": 2889,
          "2": 2157,
          "3": 1456,
          "4": 1618
        }
      }
    }
  },
  "scoreddocs": {
    "count": 173000
  }
}

`"trec-cast/v1/2020"`

Official evaluation set for TREC CAsT 2020.

Task Overview Paper

216 queries

Language: en

Query type:

Cast2020Query: (namedtuple)

query_id: str
raw_utterance: str
automatic_rewritten_utterance: str
manual_rewritten_utterance: str
manual_canonical_result_id: str
topic_number: int
turn_number: int

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, automatic_rewritten_utterance, manual_rewritten_utterance, manual_canonical_result_id, topic_number, turn_number>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2020 queries



[query_id]    [raw_utterance]    [automatic_rewritten_utterance]    [manual_rewritten_utterance]    [manual_canonical_result_id]    [topic_number]    [turn_number]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v1.2020.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

39M docs

Inherits docs from trec-cast/v1

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2020 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1.2020')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

40K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query.	`34K`	83.5%
1	Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing).	`2.7K`	6.7%
2	Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context.	`1.8K`	4.5%
3	Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information.	`1.4K`	3.5%
4	Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information.	`731`	1.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2020 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v1.2020.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dalton2020Cast}

Bibtex:

@inproceedings{Dalton2020Cast, title={CAsT 2020: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2020} }

{
  "docs": {
    "count": 38622444,
    "fields": {
      "doc_id": {
        "max_len": 44,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 216
  },
  "qrels": {
    "count": 40451,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 2697,
          "0": 33781,
          "2": 1834,
          "3": 1408,
          "4": 731
        }
      }
    }
  }
}

`"trec-cast/v1/2020/judged"`

trec-cast/v1/2020, but with queries that do not appear in the qrels removed.

208 queries

Language: en

Query type:

Cast2020Query: (namedtuple)

query_id: str
raw_utterance: str
automatic_rewritten_utterance: str
manual_rewritten_utterance: str
manual_canonical_result_id: str
topic_number: int
turn_number: int

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, raw_utterance, automatic_rewritten_utterance, manual_rewritten_utterance, manual_canonical_result_id, topic_number, turn_number>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2020/judged queries



[query_id]    [raw_utterance]    [automatic_rewritten_utterance]    [manual_rewritten_utterance]    [manual_canonical_result_id]    [topic_number]    [turn_number]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-cast.v1.2020.judged.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

39M docs

Inherits docs from trec-cast/v1

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2020/judged docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-cast.v1.2020.judged')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

40K qrels

Inherits qrels from trec-cast/v1/2020

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query.	`34K`	83.5%
1	Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing).	`2.7K`	6.7%
2	Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context.	`1.8K`	4.5%
3	Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information.	`1.4K`	3.5%
4	Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information.	`731`	1.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-cast/v1/2020/judged qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('raw_utterance'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-cast.v1.2020.judged.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dalton2020Cast}

Bibtex:

@inproceedings{Dalton2020Cast, title={CAsT 2020: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2020} }