ir_datasets : TREC Tip-of-the-Tongue

import ir_datasets
dataset = ir_datasets.load("trec-tot/2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, page_title, wikidata_id, wikidata_classes, text, sections, infoboxes>

You can find more details about the Python API here.

CLI

ir_datasets export trec-tot/2023 docs



[doc_id]    [page_title]    [wikidata_id]    [wikidata_classes]    [text]    [sections]    [infoboxes]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-tot/2023')
# Index trec-tot/2023
indexer = pt.IterDictIndexer('./indices/trec-tot_2023')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['page_title', 'wikidata_id', 'text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-tot.2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Metadata

{
  "docs": {
    "count": 231852,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  }
}

`"trec-tot/2023/dev"`

Dev query set for TREC 2023 tip-of-the-tongue search track.

queries

150 queries

Language: en

Query type:

TipOfTheTongueQuery: (namedtuple)

query_id: str
url: str
domain: str
title: str
text: str
sentence_annotations: List[Dict[str,str]]

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-tot/2023/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, url, domain, title, text, sentence_annotations>

You can find more details about the Python API here.

CLI

ir_datasets export trec-tot/2023/dev queries



[query_id]    [url]    [domain]    [title]    [text]    [sentence_annotations]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-tot/2023/dev')
index_ref = pt.IndexRef.of('./indices/trec-tot_2023') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('url'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-tot.2023.dev.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

232K docs

Inherits docs from trec-tot/2023

Language: en

Document type:

TipOfTheTongueDoc: (namedtuple)

doc_id: str
page_title: str
wikidata_id: str
wikidata_classes: List[str]
text: str
sections: Dict[str,str]
infoboxes: List[Dict[str,str]]

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-tot/2023/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, page_title, wikidata_id, wikidata_classes, text, sections, infoboxes>

You can find more details about the Python API here.

CLI

ir_datasets export trec-tot/2023/dev docs



[doc_id]    [page_title]    [wikidata_id]    [wikidata_classes]    [text]    [sections]    [infoboxes]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-tot/2023/dev')
# Index trec-tot/2023
indexer = pt.IterDictIndexer('./indices/trec-tot_2023')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['page_title', 'wikidata_id', 'text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-tot.2023.dev')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

150 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant	`0`	0.0%
1	Relevant	`150`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-tot/2023/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-tot/2023/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-tot/2023/dev')
index_ref = pt.IndexRef.of('./indices/trec-tot_2023') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('url'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.trec-tot.2023.dev.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Metadata

{
  "docs": {
    "count": 231852,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 150
  },
  "qrels": {
    "count": 150,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 150
        }
      }
    }
  }
}

`"trec-tot/2023/train"`

Train query set for TREC 2023 tip-of-the-tongue search track.

queries

150 queries

Language: en

Query type:

TipOfTheTongueQuery: (namedtuple)

query_id: str
url: str
domain: str
title: str
text: str
sentence_annotations: List[Dict[str,str]]

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-tot/2023/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, url, domain, title, text, sentence_annotations>

You can find more details about the Python API here.

CLI

ir_datasets export trec-tot/2023/train queries



[query_id]    [url]    [domain]    [title]    [text]    [sentence_annotations]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-tot/2023/train')
index_ref = pt.IndexRef.of('./indices/trec-tot_2023') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('url'))

You can find more details about PyTerrier retrieval here.

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.trec-tot.2023.train.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

232K docs

Inherits docs from trec-tot/2023

Language: en

Document type:

TipOfTheTongueDoc: (namedtuple)

doc_id: str
page_title: str
wikidata_id: str
wikidata_classes: List[str]
text: str
sections: Dict[str,str]
infoboxes: List[Dict[str,str]]

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-tot/2023/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, page_title, wikidata_id, wikidata_classes, text, sections, infoboxes>

You can find more details about the Python API here.

CLI

ir_datasets export trec-tot/2023/train docs



[doc_id]    [page_title]    [wikidata_id]    [wikidata_classes]    [text]    [sections]    [infoboxes]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:trec-tot/2023/train')
# Index trec-tot/2023
indexer = pt.IterDictIndexer('./indices/trec-tot_2023')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['page_title', 'wikidata_id', 'text'])

You can find more details about PyTerrier indexing here.

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.trec-tot.2023.train')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

150 qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not Relevant	`0`	0.0%
1	Relevant	`150`	100.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("trec-tot/2023/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export trec-tot/2023/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:trec-tot/2023/train')
index_ref = pt.IndexRef.of('./indices/trec-tot_2023') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('url'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.