: TripClickTo use this dataset, you need a copy of the source files, provided by the Trip Database.
A copy of the source files can be requested through the procedure detailed here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".
The source files you will need are:
ir_datasets expects these files to be copied/linked in ~/.ir_datasets/tripclick/.
TripClick is a large collection from the Trip Database. Relevance is inferred from click signals.
A copy of this dataset can be obtained from the Trip Database through the process described here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Test subset of tripclick, including all queries from tripclick/test/head, tripclick/test/torso, and tripclick/test/tail.
The scoreddocs are the official BM25 results from Anserini.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/test')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/test')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/test scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier
The most frequent queries in the validation set. This represents 20% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/head queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/test/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/head docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/test/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/test/head scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier
The least frequent queries in the test set. This represents 50% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/tail queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/test/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/tail docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/test/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/test/tail scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier
The moderately frequent queries in the test set. This represents 30% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/torso queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/test/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/test/torso docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/test/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/test/torso scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier
Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.
The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not clicked and appeared higher in search results |
1 | clicked |
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/train')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for docpair in dataset.docpairs_iter():
docpair # namedtuple<query_id, doc_id_a, doc_id_b>
You can find more details about the Python API here.
ir_datasets export tripclick/train docpairs
[query_id] [doc_id_a] [doc_id_b]
You can find more details about the CLI here.
No example available for PyTerrier
The most frequent queries in the train set. This represents 20% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not clicked and appeared higher in search results |
1 | clicked |
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/train/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
The most frequent queries in the train set. This represents 20% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head/dctr queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head/dctr docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not relevant; never clicked |
1 | partially relevant; clicked less than 4% of times it was shown (but at least once) |
2 | relevant; clicked more than 4% but less than 30% of times it was shown |
3 | highly relevant; clicked more than 30% of the times it was shown |
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/head/dctr qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/train/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
The least frequent queries in the train set. This represents 50% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/tail queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/tail docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not clicked and appeared higher in search results |
1 | clicked |
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/tail qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/train/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
The moderately frequent queries in the train set. This represents 30% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/torso queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/train/torso docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/train/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not clicked and appeared higher in search results |
1 | clicked |
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/train/torso qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/train/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.
The scoreddocs are the official BM25 results from Anserini.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not clicked and appeared higher in search results |
1 | clicked |
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/val')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier
The most frequent queries in the validation set. This represents 20% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val/head')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not clicked and appeared higher in search results |
1 | clicked |
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/val/head')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier
The most frequent queries in the validation set. This represents 20% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head/dctr queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head/dctr docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not relevant; never clicked |
1 | partially relevant; clicked less than 4% of times it was shown (but at least once) |
2 | relevant; clicked more than 4% but less than 30% of times it was shown |
3 | highly relevant; clicked more than 30% of the times it was shown |
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head/dctr qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/val/head/dctr')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val/head/dctr scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier
The least frequent queries in the validation set. This represents 50% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/tail queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/tail docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val/tail')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not clicked and appeared higher in search results |
1 | clicked |
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val/tail qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/val/tail')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val/tail scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier
The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.
Language: en
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/torso queries
[query_id] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from tripclick
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, title, url, text>
You can find more details about the Python API here.
ir_datasets export tripclick/val/torso docs
[doc_id] [title] [url] [text]
You can find more details about the CLI here.
import pyterrier as pt
dataset = pt.get_dataset('irds:tripclick/val/torso')
# Index tripclick
indexer = pt.IterDictIndexer('./indices/tripclick')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['title', 'url', 'text'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
0 | not clicked and appeared higher in search results |
1 | clicked |
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tripclick/val/torso qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
dataset = pt.get_dataset('irds:tripclick/val/torso')
index_ref = pt.IndexRef.of('./indices/tripclick') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
[MAP, nDCG@20]
You can find more details about PyTerrier experiments here.
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export tripclick/val/torso scoreddocs --format tsv
[query_id] [doc_id] [score]
You can find more details about the CLI here.
No example available for PyTerrier