ir_datasets : Tweets 2013 (Internet Archive)

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia docs



[doc_id]    [text]    [user_id]    [created_at]    [lang]    [reply_doc_id]    [retweet_doc_id]    [source]    [source_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.tweets2013-ia')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Citation

ir_datasets.bib:

\cite{Sequiera2017TweetsIA}

Bibtex:

@inproceedings{Sequiera2017TweetsIA, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }

Metadata

{
  "docs": {
    "count": 252713133,
    "fields": {
      "doc_id": {
        "max_len": 18,
        "common_prefix": ""
      }
    }
  }
}

`"tweets2013-ia/trec-mb-2013"`

TREC Microblog 2013 test collection.

queries

60 queries

Language: en

Query type:

TrecMb13Query: (namedtuple)

query_id: str
query: str
time: str
tweet_time: str

Examples:

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, time, tweet_time>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2013 queries



[query_id]    [query]    [time]    [tweet_time]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.tweets2013-ia.trec-mb-2013.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

253M docs

Inherits docs from tweets2013-ia

Language: multiple/other/unknown

Document type:

TweetDoc: (namedtuple)

doc_id: str
text: str
user_id: str
created_at: str
lang: str
reply_doc_id: str
retweet_doc_id: str
source: bytes
source_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2013 docs



[doc_id]    [text]    [user_id]    [created_at]    [lang]    [reply_doc_id]    [retweet_doc_id]    [source]    [source_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.tweets2013-ia.trec-mb-2013')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

71K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`62K`	87.4%
1	relevant	`5.9K`	8.2%
2	highly relevant	`3.2K`	4.4%

Examples:

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2013 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.tweets2013-ia.trec-mb-2013.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Lin2013Microblog,Sequiera2017TweetsIA}

Bibtex:

@inproceedings{Lin2013Microblog, title={Overview of the TREC-2013 Microblog Track}, author={Jimmy Lin and Miles Efron}, booktitle={TREC}, year={2013} } @inproceedings{Sequiera2017TweetsIA, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }

Metadata

{
  "docs": {
    "count": 252713133,
    "fields": {
      "doc_id": {
        "max_len": 18,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 60
  },
  "qrels": {
    "count": 71279,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 62268,
          "1": 5856,
          "2": 3155
        }
      }
    }
  }
}

`"tweets2013-ia/trec-mb-2014"`

TREC Microblog 2014 test collection.

queries

55 queries

Language: en

Query type:

TrecMb14Query: (namedtuple)

query_id: str
query: str
time: str
tweet_time: str
description: str

Examples:

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, time, tweet_time, description>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2014 queries



[query_id]    [query]    [time]    [tweet_time]    [description]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.tweets2013-ia.trec-mb-2014.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

253M docs

Inherits docs from tweets2013-ia

Language: multiple/other/unknown

Document type:

TweetDoc: (namedtuple)

doc_id: str
text: str
user_id: str
created_at: str
lang: str
reply_doc_id: str
retweet_doc_id: str
source: bytes
source_content_type: str

Examples:

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2014 docs



[doc_id]    [text]    [user_id]    [created_at]    [lang]    [reply_doc_id]    [retweet_doc_id]    [source]    [source_content_type]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.tweets2013-ia.trec-mb-2014')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

58K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`47K`	81.6%
1	relevant	`4.8K`	8.2%
2	highly relevant	`5.9K`	10.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2014 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier