← home
Github: datasets/tripclick.py

ir_datasets: TripClick

Index
  1. tripclick
  2. tripclick/test
  3. tripclick/test/head
  4. tripclick/test/tail
  5. tripclick/test/torso
  6. tripclick/train
  7. tripclick/train/head
  8. tripclick/train/head/dctr
  9. tripclick/train/tail
  10. tripclick/train/torso
  11. tripclick/val
  12. tripclick/val/head
  13. tripclick/val/head/dctr
  14. tripclick/val/tail
  15. tripclick/val/torso

Data Access Information

To use this dataset, you need a copy of the source files, provided by the Trip Database.

A copy of the source files can be requested through the procedure detailed here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".

The source files you will need are:

ir_datasets expects these files to be copied/linked in ~/.ir_datasets/tripclick/.


"tripclick"

TripClick is a large collection from the Trip Database. Relevance is inferred from click signals.

A copy of this dataset can be obtained from the Trip Database through the process described here. Documents, queries, and qrels require the "TripClick IR Benchmark"; for scoreddocs and docpairs, you will also need to request the "TripClick Training Package for Deep Learning Models".

docsCitation

Language: en

Document type:
TitleUrlTextDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. url: str
  4. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, url, text>

You can find more details about the Python API here.


"tripclick/test"

Test subset of tripclick, including all queries from tripclick/test/head, tripclick/test/torso, and tripclick/test/tail.

The scoreddocs are the official BM25 results from Anserini.

queriesdocsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/test/head"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queriesdocsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/test/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/test/tail"

The least frequent queries in the test set. This represents 50% of the search engine traffic.

queriesdocsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/test/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/test/torso"

The moderately frequent queries in the test set. This represents 30% of the search engine traffic.

queriesdocsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/test/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train"

Training subset of tripclick, including all queries from tripclick/train/head, tripclick/train/torso, and tripclick/train/tail.

The dataset provides docpairs in a full text format; we map this text back to the query and doc IDs. A small number of docpairs could not be mapped back, so they are skipped.

queriesdocsqrelsdocpairsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/head"

The most frequent queries in the train set. This represents 20% of the search engine traffic.

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/head/dctr"

The most frequent queries in the train set. This represents 20% of the search engine traffic.

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/train/head/dctr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/tail"

The least frequent queries in the train set. This represents 50% of the search engine traffic.

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/train/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/train/torso"

The moderately frequent queries in the train set. This represents 30% of the search engine traffic.

queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/train/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val"

Validation subset of tripclick, including all queries from tripclick/val/head, tripclick/val/torso, and tripclick/val/tail.

The scoreddocs are the official BM25 results from Anserini.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/val")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val/head"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val/head/dctr"

The most frequent queries in the validation set. This represents 20% of the search engine traffic.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/val/head/dctr")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val/tail"

The least frequent queries in the validation set. This represents 50% of the search engine traffic.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/val/tail")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"tripclick/val/torso"

The moderately frequent queries in the validation set. This represents 30% of the search engine traffic.

queriesdocsqrelsscoreddocsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("tripclick/val/torso")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.