ir_datasets
: TREC CAsT (Conversational Assistance)To use version 0 of the corpus, you need a copy of the Washington Post Collection, provided by NIST.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.
For the v0 corpus, the source file required is WashingtonPost.v2.tar.gz. ir_datasets expects the above file to be copied/linked under ~/.ir_datasets/wapo/WashingtonPost.v2.tar.gz.
The TREC Conversational Assistance Track (CAsT) is a benchmark for Conversational Information Seeking (CIS) models.
Version 0 of the TREC CAsT corpus. This version uses documents from the Washington Post (version 2), TREC CAR (version 2), and MS MARCO passage (version 1).
This corpus was originally meant to be used for evaluation of the 2019 task, but the Washington Post corpus was not included for scoring in the final version due to "an error in the process led to ambiguous document ids," and Washington Post documents were removed from participating systems. As such, trec-cast/v1 (which doesn't include the Washington Post) should be used for the 2019 version of the task. However, this version still can be used for the training set (trec-cast/v0/train) or for replicating the original submissions to the track (prior to the removal of Washingotn Post documents).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }{ "docs": { "count": 47696605, "fields": { "doc_id": { "max_len": 46, "common_prefix": "" } } } }
Training set provided by TREC CAsT 2019.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0/train queries
[query_id] [raw_utterance] [topic_number] [turn_number] [topic_title] [topic_description]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-cast/v0
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0/train docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 1.8K | 73.3% |
1 | relevant | 329 | 13.7% |
2 | very relevant | 311 | 13.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('raw_utterance'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0/train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }{ "docs": { "count": 47696605, "fields": { "doc_id": { "max_len": 46, "common_prefix": "" } } }, "queries": { "count": 269 }, "qrels": { "count": 2399, "fields": { "relevance": { "counts_by_value": { "2": 311, "0": 1759, "1": 329 } } } }, "scoreddocs": { "count": 269000 } }
trec-cast/2019/train, but with queries that do not appear in the qrels removed.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0/train/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-cast/v0
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0/train/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
# Index trec-cast/v0
indexer = pt.IterDictIndexer('./indices/trec-cast_v0', meta={"docno": 46})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from trec-cast/v0/train
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 1.8K | 73.3% |
1 | relevant | 329 | 13.7% |
2 | very relevant | 311 | 13.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0/train/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v0') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v0/train/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export trec-cast/v0/train/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v0/train/judged')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }{ "docs": { "count": 47696605, "fields": { "doc_id": { "max_len": 46, "common_prefix": "" } } }, "queries": { "count": 120 }, "qrels": { "count": 2399, "fields": { "relevance": { "counts_by_value": { "2": 311, "0": 1759, "1": 329 } } } }, "scoreddocs": { "count": 120000 } }
Version 1 of the TREC CAsT corpus. This version uses documents from the TREC CAR (version 2) and MS MARCO passage (version 1). This version of the corpus was used for TREC CAsT 2019 and 2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }{ "docs": { "count": 38622444, "fields": { "doc_id": { "max_len": 44, "common_prefix": "" } } } }
Official evaluation set for TREC CAsT 2019.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for query in dataset.queries_iter():
query # namedtuple<query_id, raw_utterance, topic_number, turn_number, topic_title, topic_description>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2019 queries
[query_id] [raw_utterance] [topic_number] [turn_number] [topic_title] [topic_description]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-cast/v1
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2019 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query. | 21K | 72.3% |
1 | Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing). | 2.9K | 9.8% |
2 | Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context. | 2.2K | 7.3% |
3 | Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information. | 1.5K | 5.0% |
4 | Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information. | 1.6K | 5.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2019 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('raw_utterance'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2019 scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }{ "docs": { "count": 38622444, "fields": { "doc_id": { "max_len": 44, "common_prefix": "" } } }, "queries": { "count": 479 }, "qrels": { "count": 29350, "fields": { "relevance": { "counts_by_value": { "0": 21230, "1": 2889, "2": 2157, "3": 1456, "4": 1618 } } } }, "scoreddocs": { "count": 479000 } }
trec-cast/v1/2019, but with queries that do not appear in the qrels removed.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2019/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-cast/v1
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2019/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from trec-cast/v1/2019
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query. | 21K | 72.3% |
1 | Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing). | 2.9K | 9.8% |
2 | Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context. | 2.2K | 7.3% |
3 | Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information. | 1.5K | 5.0% |
4 | Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information. | 1.6K | 5.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2019/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2019/judged")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2019/judged scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2019/judged')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
Bibtex:
@inproceedings{Dalton2019Cast, title={CAsT 2019: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2019} }{ "docs": { "count": 38622444, "fields": { "doc_id": { "max_len": 44, "common_prefix": "" } } }, "queries": { "count": 173 }, "qrels": { "count": 29350, "fields": { "relevance": { "counts_by_value": { "0": 21230, "1": 2889, "2": 2157, "3": 1456, "4": 1618 } } } }, "scoreddocs": { "count": 173000 } }
Official evaluation set for TREC CAsT 2020.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, raw_utterance, automatic_rewritten_utterance, manual_rewritten_utterance, manual_canonical_result_id, topic_number, turn_number>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2020 queries
[query_id] [raw_utterance] [automatic_rewritten_utterance] [manual_rewritten_utterance] [manual_canonical_result_id] [topic_number] [turn_number]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('raw_utterance'))
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-cast/v1
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2020 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query. | 34K | 83.5% |
1 | Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing). | 2.7K | 6.7% |
2 | Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context. | 1.8K | 4.5% |
3 | Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information. | 1.4K | 3.5% |
4 | Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information. | 731 | 1.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2020 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('raw_utterance'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dalton2020Cast, title={CAsT 2020: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2020} }{ "docs": { "count": 38622444, "fields": { "doc_id": { "max_len": 44, "common_prefix": "" } } }, "queries": { "count": 216 }, "qrels": { "count": 40451, "fields": { "relevance": { "counts_by_value": { "1": 2697, "0": 33781, "2": 1834, "3": 1408, "4": 731 } } } } }
trec-cast/v1/2020, but with queries that do not appear in the qrels removed.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2020/judged queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from trec-cast/v1
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2020/judged docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
# Index trec-cast/v1
indexer = pt.IterDictIndexer('./indices/trec-cast_v1', meta={"docno": 44})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from trec-cast/v1/2020
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | Fails to meet. The passage is not relevant to the question. The passage is unrelated to the target query. | 34K | 83.5% |
1 | Slightly meets. The passage includes some information about the turn, but does not directly answer it. Users will find some useful information in the passage that may lead to the correct answer, perhaps after additional rounds of conversation (better than nothing). | 2.7K | 6.7% |
2 | Moderately meets. The passage answers the turn, but is focused on other information that is unrelated to the question. The passage may contain the answer, but users will need extra effort to pick the correct portion. The passage may be relevant, but it may only partially answer the turn, missing a small aspect of the context. | 1.8K | 4.5% |
3 | Highly meets. The passage answers the question and is focused on the turn. It would be a satisfactory answer if Google Assistant or Alexa returned this passage in response to the query. It may contain limited extraneous information. | 1.4K | 3.5% |
4 | Fully meets. The passage is a perfect answer for the turn. It includes all of the information needed to fully answer the turn in the conversation context. It focuses only on the subject and contains little extra information. | 731 | 1.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("trec-cast/v1/2020/judged")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export trec-cast/v1/2020/judged qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-cast/v1/2020/judged')
index_ref = pt.IndexRef.of('./indices/trec-cast_v1') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dalton2020Cast, title={CAsT 2020: The Conversational Assistance Track Overview}, author={Jeffrey Dalton and Chenyan Xiong and Jamie Callan}, booktitle={TREC}, year={2020} }{ "docs": { "count": 38622444, "fields": { "doc_id": { "max_len": 44, "common_prefix": "" } } }, "queries": { "count": 208 }, "qrels": { "count": 40451, "fields": { "relevance": { "counts_by_value": { "1": 2697, "0": 33781, "2": 1834, "3": 1408, "4": 731 } } } } }