← home
Github: datasets/tripclick.py

ir_datasets: TripClick

Index
  1. tripclick
  2. tripclick/logs
  3. tripclick/test
  4. tripclick/test/head
  5. tripclick/test/tail
  6. tripclick/test/torso
  7. tripclick/train
  8. tripclick/train/head
  9. tripclick/train/head/dctr
  10. tripclick/train/hofstaetter-triples
  11. tripclick/train/tail
  12. tripclick/train/torso
  13. tripclick/val
  14. tripclick/val/head
  15. tripclick/val/head/dctr
  16. tripclick/val/tail
  17. tripclick/val/torso

Data Access Information

To use this dataset, you need a copy of the source files, provided by the Trip Database.

A copy of the source files can be requested through the procedure detailed here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models". If you want the raw query logs, you will need to request the "Logs Dataset".

The source files you will need are:

ir_datasets expects these files to be copied/linked in ~/.ir_datasets/tripclick/.


"tripclick"

TripClick is a large collection from the Trip Database. Relevance is inferred from click signals.

A copy of this dataset can be obtained from the Trip Database through the process described here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".

docsCitationMetadata
1.5M docs

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.


"tripclick/logs"

Raw query logs from TripClick.

Note that this subset includes a broader set of documents than the main collection, but they only provide the title and URL.

docsqlogsCitationMetadata
5.2M docs

Language: en

Document type:
TripClickPartialDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/logs")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url>

You can find more details about the Python API here.


"tripclick/test"

Test subset of tripclick, including all queries from tripclick/test/head, tripclick/test/torso, and tripclick/test/tail.

The scoreddocs are the official BM25 results from Anserini.

queriesdocsscoreddocsCitationMetadata
3.5K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/test/head"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queriesdocsscoreddocsCitationMetadata
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/test/tail"

The least frequent queries in the test set. This represents 50% of the search engine traffic.

queriesdocsscoreddocsCitationMetadata
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/test/torso"

The moderately frequent queries in the test set. This represents 30% of the search engine traffic.

queriesdocsscoreddocsCitationMetadata
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train"

Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.

The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.

queriesdocsqrelsdocpairsCitationMetadata
686K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/head"

The most frequent queries in the train set. This represents 20% of the search engine traffic.

queriesdocsqrelsCitationMetadata
3.5K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/head/dctr"

The most frequent queries in the train set. This represents 20% of the search engine traffic.

queriesdocsqrelsCitationMetadata
3.5K queries

Inherits queries from tripclick/train/head

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/hofstaetter-triples"

A version of tripclick/train that replaces the original (noisy) training triples (docpairs) with those sampled from BM25 instead, as suggested by Hofstätter et al (2022).

queriesdocsqrelsdocpairsCitationMetadata
686K queries

Inherits queries from tripclick/train

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/train/hofstaetter-triples")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/tail"

The least frequent queries in the train set. This represents 50% of the search engine traffic.

queriesdocsqrelsCitationMetadata
576K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/torso"

The moderately frequent queries in the train set. This represents 30% of the search engine traffic.

queriesdocsqrelsCitationMetadata
106K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val"

Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.

The scoreddocs are the official BM25 results from Anserini.

queriesdocsqrelsscoreddocsCitationMetadata
3.5K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val/head"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queriesdocsqrelsscoreddocsCitationMetadata
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val/head/dctr"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queriesdocsqrelsscoreddocsCitationMetadata
1.2K queries

Inherits queries from tripclick/val/head

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val/tail"

The least frequent queries in the validation set. This represents 50% of the search engine traffic.

queriesdocsqrelsscoreddocsCitationMetadata
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val/torso"

The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.

queriesdocsqrelsscoreddocsCitationMetadata
1.2K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.