ir_datasets : TREC CAR

import ir_datasets
dataset = ir_datasets.load("car/v1.5")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

\cite{Dietz2017Car}

Bibtex:

@article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  }
}

`"car/v1.5/test200"`

Un-official test set consisting of manually-selected articles. Sometimes used as a validation set.

2.0K queries

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/test200")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/test200 queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/test200')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.test200.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/test200")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/test200 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/test200')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.test200')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

4.7K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Paragraph appears under heading	`4.7K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/test200")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/test200 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/test200')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.car.v1.5.test200.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Nanni2017BenchmarkCar,Dietz2017Car}

Bibtex:

@inproceedings{Nanni2017BenchmarkCar, title={Benchmark for complex answer retrieval}, author={Nanni, Federico and Mitra, Bhaskar and Magnusson, Matt and Dietz, Laura}, booktitle={ICTIR}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1987
  },
  "qrels": {
    "count": 4706,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 4706
        }
      }
    }
  }
}

`"car/v1.5/train/fold0"`

Fold 0 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

468K queries

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold0")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold0 queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold0')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.train.fold0.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold0")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold0 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold0')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.train.fold0')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.1M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Paragraph appears under heading	`1.1M`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold0")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold0 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold0')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.car.v1.5.train.fold0.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dietz2017TrecCar,Dietz2017Car}

Bibtex:

@inproceedings{Dietz2017TrecCar, title={TREC Complex Answer Retrieval Overview.}, author={Dietz, Laura and Verma, Manisha and Radlinski, Filip and Craswell, Nick}, booktitle={TREC}, year={2017} } @article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 467946
  },
  "qrels": {
    "count": 1054369,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1054369
        }
      }
    }
  }
}

`"car/v1.5/train/fold1"`

Fold 1 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

467K queries

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold1 queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold1')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.train.fold1.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold1 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold1')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.train.fold1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.1M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Paragraph appears under heading	`1.1M`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold1")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold1 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold1')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.car.v1.5.train.fold1.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dietz2017TrecCar,Dietz2017Car}

Bibtex:

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 466596
  },
  "qrels": {
    "count": 1052398,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1052398
        }
      }
    }
  }
}

`"car/v1.5/train/fold2"`

Fold 2 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

469K queries

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold2 queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold2')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.train.fold2.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold2 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold2')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.train.fold2')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.1M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Paragraph appears under heading	`1.1M`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold2")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold2 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold2')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.car.v1.5.train.fold2.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dietz2017TrecCar,Dietz2017Car}

Bibtex:

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 469323
  },
  "qrels": {
    "count": 1061162,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1061162
        }
      }
    }
  }
}

`"car/v1.5/train/fold3"`

Fold 3 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

463K queries

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold3 queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold3')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.train.fold3.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold3 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold3')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.train.fold3')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.0M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Paragraph appears under heading	`1.0M`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold3 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold3')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.car.v1.5.train.fold3.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dietz2017TrecCar,Dietz2017Car}

Bibtex:

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 463314
  },
  "qrels": {
    "count": 1046784,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1046784
        }
      }
    }
  }
}

`"car/v1.5/train/fold4"`

Fold 4 of the official large training set for TREC CAR 2017. Relevance assumed from hierarchical structure of pages (i.e., paragraphs under a header are assumed relevant.)

469K queries

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold4 queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold4')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.train.fold4.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold4 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold4')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.train.fold4')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

1.1M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Paragraph appears under heading	`1.1M`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/train/fold4")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/train/fold4 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/train/fold4')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.car.v1.5.train.fold4.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dietz2017TrecCar,Dietz2017Car}

Bibtex:

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 468789
  },
  "qrels": {
    "count": 1061911,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 1061911
        }
      }
    }
  }
}

`"car/v1.5/trec-y1"`

Official test set of TREC CAR 2017 (year 1).

2.3K queries

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/trec-y1 queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.trec-y1.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/trec-y1 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.trec-y1')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

\cite{Dietz2017TrecCar,Dietz2017Car}

Bibtex:

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 2287
  }
}

`"car/v1.5/trec-y1/auto"`

Official test set of TREC CAR 2017 (year 1), using automatic relevance judgments (assumed from hierarchical structure of pages, i.e., paragraphs under a header are assumed relevant.)

2.3K queries

Inherits queries from car/v1.5/trec-y1

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/auto")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/trec-y1/auto queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/auto')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.trec-y1.auto.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/auto")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/trec-y1/auto docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/auto')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.trec-y1.auto')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

5.8K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
1	Paragraph appears under heading	`5.8K`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/auto")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/trec-y1/auto qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/auto')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.car.v1.5.trec-y1.auto.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dietz2017TrecCar,Dietz2017Car}

Bibtex:

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 2287
  },
  "qrels": {
    "count": 5820,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 5820
        }
      }
    }
  }
}

`"car/v1.5/trec-y1/manual"`

Official test set of TREC CAR 2017 (year 1), using manual graded relevance judgments.

2.3K queries

Inherits queries from car/v1.5/trec-y1

Language: en

Query type:

CarQuery: (namedtuple)

query_id: str
text: str
title: str
headings: Tuple[str, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/manual")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, title, headings>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/trec-y1/manual queries



[query_id]    [text]    [title]    [headings]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/manual')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.car.v1.5.trec-y1.manual.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

30M docs

Inherits docs from car/v1.5

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/manual")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/trec-y1/manual docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/manual')
# Index car/v1.5
indexer = pt.IterDictIndexer('./indices/car_v1.5', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v1.5.trec-y1.manual')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

30K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-2	Trash	`42`	0.1%
-1	NO, non-relevant	`13K`	43.2%
0	Non-relevant, but roughly on TOPIC	`9.2K`	31.2%
1	CAN be mentioned	`3.1K`	10.5%
2	SHOULD be mentioned	`2.0K`	6.7%
3	MUST be mentioned	`2.5K`	8.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v1.5/trec-y1/manual")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export car/v1.5/trec-y1/manual qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:car/v1.5/trec-y1/manual')
index_ref = pt.IndexRef.of('./indices/car_v1.5') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.car.v1.5.trec-y1.manual.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

\cite{Dietz2017TrecCar,Dietz2017Car}

Bibtex:

{
  "docs": {
    "count": 29678367,
    "fields": {
      "doc_id": {
        "max_len": 40,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 2287
  },
  "qrels": {
    "count": 29571,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "-1": 12785,
          "0": 9219,
          "1": 3094,
          "2": 1970,
          "3": 2461,
          "-2": 42
        }
      }
    }
  }
}

`"car/v2.0"`

Version 2.0 of the TREC CAR dataset.

docs

30M docs

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("car/v2.0")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export car/v2.0 docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:car/v2.0')
# Index car/v2.0
indexer = pt.IterDictIndexer('./indices/car_v2.0', meta={"docno": 40})
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.car.v2.0')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

\cite{Dietz2017Car}

Bibtex:

@article{Dietz2017Car, title={{TREC CAR}: A Data Set for Complex Answer Retrieval}, author={Laura Dietz and Ben Gamari}, year={2017}, note={Version 1.5}, url={http://trec-car.cs.unh.edu} }