ir_datasets
: TripClickTo use this dataset, you need a copy of the source files, provided by the Trip Database.
A copy of the source files can be requested through the procedure detailed here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models". If you want the raw query logs, you will need to request the "Logs Dataset".
The source files you will need are:
ir_datasets expects these files to be copied/linked in ~/.ir_datasets/tripclick/.
TripClick is a large collection from the Trip Database. Relevance is inferred from click signals.
A copy of this dataset can be obtained from the Trip Database through the process described here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } } }
Raw query logs from TripClick.
Note that this subset includes a broader set of documents than the main collection, but they only provide the title and URL.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/logs")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url>
You can find more details about the Python API here.
ir_datasets export tripclick/logs docs
[doc_id] [title] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/logs')
# Index tripclick/logs
indexer = pt.IterDictIndexer('./indices/tripclick_logs')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/logs")
for qlog in dataset.qlogs_iter():
qlog # namedtuple<session_id, query_id, query, query_orig, time, items>
You can find more details about the Python API here.
No example available for CLI
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 5196956, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "qlogs": { "count": 5317350 } }
Test subset of tripclick, including all queries from tripclick/test/head, tripclick/test/torso, and tripclick/test/tail.
The scoreddocs are the official BM25 results from Anserini.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/test scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 3525 }, "scoreddocs": { "count": 3486402 } }
The most frequent queries in the validation set. This represents 20% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/head queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/head docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/test/head scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 1175 }, "scoreddocs": { "count": 1159303 } }
The least frequent queries in the test set. This represents 50% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/tail queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/tail docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/test/tail scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 1175 }, "scoreddocs": { "count": 1165127 } }
The moderately frequent queries in the test set. This represents 30% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/torso queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/torso docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/test/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/test/torso scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 1175 }, "scoreddocs": { "count": 1161972 } }
Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.
The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 1.5M | 54.2% |
1 | clicked | 1.2M | 45.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export tripclick/train docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 685649 }, "qrels": { "count": 2705212, "fields": { "relevance": { "counts_by_value": { "1": 1239161, "0": 1466051 } } } }, "docpairs": { "count": 23221224 } }
The most frequent queries in the train set. This represents 20% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 61K | 52.4% |
1 | clicked | 56K | 47.6% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 3529 }, "qrels": { "count": 116821, "fields": { "relevance": { "counts_by_value": { "1": 55663, "0": 61158 } } } } }
The most frequent queries in the train set. This represents 20% of the search engine traffic.
Inherits queries from tripclick/train/head
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head/dctr queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head/dctr docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant; never clicked | 93K | 72.2% |
1 | partially relevant; clicked less than 4% of times it was shown (but at least once) | 24K | 18.8% |
2 | relevant; clicked more than 4% but less than 30% of times it was shown | 7.8K | 6.1% |
3 | highly relevant; clicked more than 30% of the times it was shown | 3.7K | 2.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head/dctr qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 3529 }, "qrels": { "count": 128420, "fields": { "relevance": { "counts_by_value": { "3": 3684, "2": 7844, "1": 24139, "0": 92753 } } } } }
A version of tripclick/train that replaces the original (noisy) training triples (docpairs) with those sampled from BM25 instead, as suggested by Hofstätter et al (2022).
Inherits queries from tripclick/train
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/hofstaetter-triples queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/hofstaetter-triples')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/hofstaetter-triples docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/hofstaetter-triples')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Inherits qrels from tripclick/train
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 1.5M | 54.2% |
1 | clicked | 1.2M | 45.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/hofstaetter-triples qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/hofstaetter-triples')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export tripclick/train/hofstaetter-triples docpairs
[query_id] [doc_id_a] [doc_id_b]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} } @inproceedings{Hofstaetter2022TripClick, title={Establishing Strong Baselines for TripClick Health Retrieval}, author={Sebastian Hofst\"atter and Sophia Althammer and Mete Sertkan and Allan Hanbury}, year={2022}, booktitle={ECIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 685649 }, "qrels": { "count": 2705212, "fields": { "relevance": { "counts_by_value": { "1": 1239161, "0": 1466051 } } } }, "docpairs": { "count": 10000000 } }
The least frequent queries in the train set. This represents 50% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/tail queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/tail docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 863K | 53.2% |
1 | clicked | 759K | 46.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/tail qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 576156 }, "qrels": { "count": 1621493, "fields": { "relevance": { "counts_by_value": { "1": 758678, "0": 862815 } } } } }
The moderately frequent queries in the train set. This represents 30% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/torso queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/torso docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 542K | 56.1% |
1 | clicked | 425K | 43.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/torso qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/train/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 105964 }, "qrels": { "count": 966898, "fields": { "relevance": { "counts_by_value": { "1": 424820, "0": 542078 } } } } }
Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.
The scoreddocs are the official BM25 results from Anserini.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 42K | 51.4% |
1 | clicked | 40K | 48.6% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 3525 }, "qrels": { "count": 82409, "fields": { "relevance": { "counts_by_value": { "1": 40083, "0": 42326 } } } }, "scoreddocs": { "count": 3503310 } }
The most frequent queries in the validation set. This represents 20% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 32K | 50.2% |
1 | clicked | 32K | 49.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 1175 }, "qrels": { "count": 64364, "fields": { "relevance": { "counts_by_value": { "1": 32067, "0": 32297 } } } }, "scoreddocs": { "count": 1166804 } }
The most frequent queries in the validation set. This represents 20% of the search engine traffic.
Inherits queries from tripclick/val/head
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head/dctr queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head/dctr docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant; never clicked | 47K | 70.3% |
1 | partially relevant; clicked less than 4% of times it was shown (but at least once) | 14K | 20.9% |
2 | relevant; clicked more than 4% but less than 30% of times it was shown | 4.0K | 5.9% |
3 | highly relevant; clicked more than 30% of the times it was shown | 2.0K | 2.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head/dctr qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Inherits scoreddocs from tripclick/val/head
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head/dctr scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 1175 }, "qrels": { "count": 66812, "fields": { "relevance": { "counts_by_value": { "2": 3974, "1": 13936, "0": 46936, "3": 1966 } } } }, "scoreddocs": { "count": 1166804 } }
The least frequent queries in the validation set. This represents 50% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/tail queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/tail docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 2.1K | 53.6% |
1 | clicked | 1.8K | 46.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val/tail qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val/tail scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 1175 }, "qrels": { "count": 3912, "fields": { "relevance": { "counts_by_value": { "1": 1814, "0": 2098 } } } }, "scoreddocs": { "count": 1166192 } }
The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/torso queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from tripclick
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/torso docs
[doc_id] [title] [url] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not clicked and appeared higher in search results | 7.9K | 56.1% |
1 | clicked | 6.2K | 43.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val/torso qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:tripclick/val/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val/torso scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Bibtex:
@inproceedings{Rekabsaz2021TripClick, title={TripClick: The Log Files of a Large Health Web Search Engine}, author={Navid Rekabsaz and Oleg Lesota and Markus Schedl and Jon Brassey and Carsten Eickhoff}, year={2021}, booktitle={SIGIR} }{ "docs": { "count": 1523878, "fields": { "doc_id": { "max_len": 8, "common_prefix": "" } } }, "queries": { "count": 1175 }, "qrels": { "count": 14133, "fields": { "relevance": { "counts_by_value": { "1": 6202, "0": 7931 } } } }, "scoreddocs": { "count": 1170314 } }