ir_datasets
: TREC CARAn ad-hoc passage retrieval collection, constructed from Wikipedia and used as the basis of the TREC Complex Answer Retrieval (CAR) task.
Version 1.5 of the TREC dataset. This version is used for year 1 (2017) of the TREC CAR shared task.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } } }
Un-official test set consisting of manually-selected articles. Sometimes used as a validation set.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/test200")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/test200 queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/test200')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/test200")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/test200 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/test200')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Paragraph appears under heading | 4.7K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/test200")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export car/v1.5/test200 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/test200')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Nanni2017BenchmarkCar, title={Benchmark for complex answer retrieval}, author={Nanni, Federico and Mitra, Bhaskar and Magnusson, Matt and Dietz, Laura}, booktitle={ICTIR}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 1987 }, "qrels": { "count": 4706, "fields": { "relevance": { "counts_by_value": { "1": 4706 } } } } }
Fold 0 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold0")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold0 queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold0')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold0")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold0 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold0')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Paragraph appears under heading | 1.1M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold0")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold0 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold0')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 467946 }, "qrels": { "count": 1054369, "fields": { "relevance": { "counts_by_value": { "1": 1054369 } } } } }
Fold 1 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold1 queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold1')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold1 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold1')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Paragraph appears under heading | 1.1M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold1")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold1 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold1')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 466596 }, "qrels": { "count": 1052398, "fields": { "relevance": { "counts_by_value": { "1": 1052398 } } } } }
Fold 2 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold2")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold2 queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold2')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold2 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold2')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Paragraph appears under heading | 1.1M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold2")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold2 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold2')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 469323 }, "qrels": { "count": 1061162, "fields": { "relevance": { "counts_by_value": { "1": 1061162 } } } } }
Fold 3 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold3")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold3 queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold3')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold3")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold3 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold3')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Paragraph appears under heading | 1.0M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold3")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold3 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold3')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 463314 }, "qrels": { "count": 1046784, "fields": { "relevance": { "counts_by_value": { "1": 1046784 } } } } }
Fold 4 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold4")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold4 queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold4')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold4")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold4 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold4')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Paragraph appears under heading | 1.1M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold4")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export car/v1.5/train/fold4 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold4')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 468789 }, "qrels": { "count": 1061911, "fields": { "relevance": { "counts_by_value": { "1": 1061911 } } } } }
Official test set of TREC CAR 2017 (year 1).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/trec-y1 queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/trec-y1 docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 2287 } }
Official test set of TREC CAR 2017 (year 1), using automatic relevance judgments (assumed from hierarchical structure of pages, i.e., paragraphs under a header are assumed relevant.)
Inherits queries from car/v1.5/trec-y1
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/auto")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/trec-y1/auto queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/auto')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/auto")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/trec-y1/auto docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/auto')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | Paragraph appears under heading | 5.8K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/auto")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export car/v1.5/trec-y1/auto qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/auto')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 2287 }, "qrels": { "count": 5820, "fields": { "relevance": { "counts_by_value": { "1": 5820 } } } } }
Official test set of TREC CAR 2017 (year 1), using manual graded relevance judgments.
Inherits queries from car/v1.5/trec-y1
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/manual")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, title, headings>
You can find more details about the Python API here.
ir_datasets export car/v1.5/trec-y1/manual queries
[query_id] [text] [title] [headings]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/manual')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Inherits docs from car/v1.5
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/manual")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export car/v1.5/trec-y1/manual docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/manual')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
-2 | Trash | 42 | 0.1% |
-1 | NO, non-relevant | 13K | 43.2% |
0 | Non-relevant, but roughly on TOPIC | 9.2K | 31.2% |
1 | CAN be mentioned | 3.1K | 10.5% |
2 | SHOULD be mentioned | 2.0K | 6.7% |
3 | MUST be mentioned | 2.5K | 8.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/manual")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export car/v1.5/trec-y1/manual qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/manual')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29678367, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } }, "queries": { "count": 2287 }, "qrels": { "count": 29571, "fields": { "relevance": { "counts_by_value": { "-1": 12785, "0": 9219, "1": 3094, "2": 1970, "3": 2461, "-2": 42 } } } } }
Version 2.0 of the TREC CAR dataset.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("car/v2.0")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v2.0')
# Index car/v2.0
indexer = pt.IterDictIndexer('./indices/car_v2.0', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }{ "docs": { "count": 29794697, "fields": { "doc_id": { "max_len": 40, "common_prefix": "" } } } }