← home
Github: datasets/tripclick.py

ir_datasets: TripClick

Index
  1. tripclick
  2. tripclick/logs
  3. tripclick/test
  4. tripclick/test/head
  5. tripclick/test/tail
  6. tripclick/test/torso
  7. tripclick/train
  8. tripclick/train/head
  9. tripclick/train/head/dctr
  10. tripclick/train/hofstaetter-triples
  11. tripclick/train/tail
  12. tripclick/train/torso
  13. tripclick/val
  14. tripclick/val/head
  15. tripclick/val/head/dctr
  16. tripclick/val/tail
  17. tripclick/val/torso

Data Access Information

To use this dataset, you need a copy of the source files, provided by the Trip Database.

A copy of the source files can be requested through the procedure detailed here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models". If you want the raw query logs, you will need to request the "Logs Dataset".

The source files you will need are:

ir_datasets expects these files to be copied/linked in ~/.ir_datasets/tripclick/.


"tripclick"

TripClick is a large collection from the Trip Database. Relevance is inferred from click signals.

A copy of this dataset can be obtained from the Trip Database through the process described here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".

docs
1.5M docs

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/logs"

Raw query logs from TripClick.

Note that this subset includes a broader set of documents than the main collection, but they only provide the title and URL.

docs
5.2M docs

Language: en

Document type:
TripClickPartialDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/logs")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/logs docs
[doc_id]    [title]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/logs')
# Index tripclick/logs
indexer = pt.IterDictIndexer('./indices/tripclick_logs')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url'])

You can find more details about PyTerrier indexing here.

qlogs
5.3M qlogs
Query Log type:
TripClickQlog: (namedtuple)
  1. session_id: str
  2. query_id: str
  3. query: str
  4. query_orig: str
  5. time: datetime
  6. items: Tuple[
    LogItem: (namedtuple)
    1. doc_id: str
    2. clicked: bool
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/logs")
for qlog in dataset.qlogs_iter():
    qlog # namedtuple<session_id, query_id, query, query_orig, time, items>

You can find more details about the Python API here.

CLI

No example available for CLI

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/test"

Test subset of tripclick, including all queries from tripclick/test/head, tripclick/test/torso, and tripclick/test/tail.

The scoreddocs are the official BM25 results from Anserini.

queries
3.5K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

scoreddocs
3.5M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/test/head"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queries
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/head queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/head docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

scoreddocs
1.2M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/head scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/test/tail"

The least frequent queries in the test set. This represents 50% of the search engine traffic.

queries
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/tail queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/tail docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

scoreddocs
1.2M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/tail scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/test/torso"

The moderately frequent queries in the test set. This represents 30% of the search engine traffic.

queries
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/torso queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/torso docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

scoreddocs
1.2M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/test/torso scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/train"

Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.

The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.

queries
686K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
2.7M qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results1.5M54.2%
1clicked1.2M45.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

docpairs
23M docpairs
Document Pair type:
GenericDocPair: (namedtuple)
  1. query_id: str
  2. doc_id_a: str
  3. doc_id_b: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train docpairs
[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/train/head"

The most frequent queries in the train set. This represents 20% of the search engine traffic.

queries
3.5K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/head queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/head docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
117K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results61K52.4%
1clicked56K47.6%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/head qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/train/head/dctr"

The most frequent queries in the train set. This represents 20% of the search engine traffic.

queries
3.5K queries

Inherits queries from tripclick/train/head

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/head/dctr queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/head/dctr docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
128K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant; never clicked93K72.2%
1partially relevant; clicked less than 4% of times it was shown (but at least once)24K18.8%
2relevant; clicked more than 4% but less than 30% of times it was shown7.8K6.1%
3highly relevant; clicked more than 30% of the times it was shown3.7K2.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/head/dctr qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/train/hofstaetter-triples"

A version of tripclick/train that replaces the original (noisy) training triples (docpairs) with those sampled from BM25 instead, as suggested by Hofstätter et al (2022).

queries
686K queries

Inherits queries from tripclick/train

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/hofstaetter-triples queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/hofstaetter-triples')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/hofstaetter-triples docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/hofstaetter-triples')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
2.7M qrels

Inherits qrels from tripclick/train

Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results1.5M54.2%
1clicked1.2M45.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/hofstaetter-triples qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/hofstaetter-triples')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

docpairs
10M docpairs
Document Pair type:
GenericDocPair: (namedtuple)
  1. query_id: str
  2. doc_id_a: str
  3. doc_id_b: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for docpair in dataset.docpairs_iter():
    docpair # namedtuple<query_id, doc_id_a, doc_id_b>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/hofstaetter-triples docpairs
[query_id]    [doc_id_a]    [doc_id_b]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick,Hofstaetter2022TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} } @inproceedings{Hofstaetter2022TripClick, title={Establishing Strong Baselines for TripClick Health Retrieval}, author={Sebastian Hofst\"atter and Sophia Althammer and Mete Sertkan and Allan Hanbury}, year={2022}, booktitle={ECIR} }
Metadata

"tripclick/train/tail"

The least frequent queries in the train set. This represents 50% of the search engine traffic.

queries
576K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/tail queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/tail docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
1.6M qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results863K53.2%
1clicked759K46.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/tail qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/train/torso"

The moderately frequent queries in the train set. This represents 30% of the search engine traffic.

queries
106K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/torso queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/torso docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
967K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results542K56.1%
1clicked425K43.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/train/torso qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/val"

Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.

The scoreddocs are the official BM25 results from Anserini.

queries
3.5K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
82K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results42K51.4%
1clicked40K48.6%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
3.5M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/val/head"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queries
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/head queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/head docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
64K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results32K50.2%
1clicked32K49.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/head qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
1.2M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/head scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/val/head/dctr"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queries
1.2K queries

Inherits queries from tripclick/val/head

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/head/dctr queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/head/dctr docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
67K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant; never clicked47K70.3%
1partially relevant; clicked less than 4% of times it was shown (but at least once)14K20.9%
2relevant; clicked more than 4% but less than 30% of times it was shown4.0K5.9%
3highly relevant; clicked more than 30% of the times it was shown2.0K2.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/head/dctr qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
1.2M scoreddocs

Inherits scoreddocs from tripclick/val/head

Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/head/dctr scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/val/tail"

The least frequent queries in the validation set. This represents 50% of the search engine traffic.

queries
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/tail queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/tail docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
3.9K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results2.1K53.6%
1clicked1.8K46.4%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/tail qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
1.2M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/tail scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata

"tripclick/val/torso"

The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.

queries
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/torso queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs
1.5M docs

Inherits docs from tripclick

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/torso docs
[doc_id]    [title]    [url]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])

You can find more details about PyTerrier indexing here.

qrels
14K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not clicked and appeared higher in search results7.9K56.1%
1clicked6.2K43.9%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/torso qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

scoreddocs
1.2M scoreddocs
Scored Document type:
GenericScoredDoc: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. score: float

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI
ir_datasets export tripclick/val/torso scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Rekabsaz2021TripClick}

Bibtex:

@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }
Metadata