Github: datasets/tweets2013_ia.py

`ir_datasets`: Tweets 2013 (Internet Archive)

Index

tweets2013-ia
tweets2013-ia/trec-mb-2013
tweets2013-ia/trec-mb-2014

`"tweets2013-ia"`

A collection of tweets from a 2-month window achived by the Internet Achive. This collection can be a stand-in document collection for the TREC Microblog 2013-14 tasks. (Even though it is not exactly the same collection, Sequiera and Lin show that it it close enough.)

This collection is automatically downloaded from the Internet Archive, though download speeds are often slow so it takes some time. ir_datasets constructs a new directory hierarchy during the download process to facilitate fast lookups and slices.

docs

253M docs

Language: multiple/other/unknown

Document type:

TweetDoc: (namedtuple)

doc_id: str
text: str
user_id: str
created_at: str
lang: str
reply_doc_id: str
retweet_doc_id: str
source: bytes
source_content_type: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia docs



[doc_id]    [text]    [user_id]    [created_at]    [lang]    [reply_doc_id]    [retweet_doc_id]    [source]    [source_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Sequiera2017TweetsIA}

Bibtex:

@inproceedings{Sequiera2017TweetsIA, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }

Metadata

{
  "docs": {
    "count": 252713133,
    "fields": {
      "doc_id": {
        "max_len": 18,
        "common_prefix": ""
      }
    }
  }
}

`"tweets2013-ia/trec-mb-2013"`

TREC Microblog 2013 test collection.

queries

60 queries

Language: en

Query type:

TrecMb13Query: (namedtuple)

query_id: str
query: str
time: str
tweet_time: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, time, tweet_time>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2013 queries



[query_id]    [query]    [time]    [tweet_time]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

253M docs

Inherits docs from tweets2013-ia

Language: multiple/other/unknown

Document type:

TweetDoc: (namedtuple)

doc_id: str
text: str
user_id: str
created_at: str
lang: str
reply_doc_id: str
retweet_doc_id: str
source: bytes
source_content_type: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2013 docs



[doc_id]    [text]    [user_id]    [created_at]    [lang]    [reply_doc_id]    [retweet_doc_id]    [source]    [source_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

71K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`62K`	87.4%
1	relevant	`5.9K`	8.2%
2	highly relevant	`3.2K`	4.4%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2013 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lin2013Microblog,Sequiera2017TweetsIA}

Bibtex:

@inproceedings{Lin2013Microblog, title={Overview of the TREC-2013 Microblog Track}, author={Jimmy Lin and Miles Efron}, booktitle={TREC}, year={2013} } @inproceedings{Sequiera2017TweetsIA, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }

Metadata

{
  "docs": {
    "count": 252713133,
    "fields": {
      "doc_id": {
        "max_len": 18,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 60
  },
  "qrels": {
    "count": 71279,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 62268,
          "1": 5856,
          "2": 3155
        }
      }
    }
  }
}

`"tweets2013-ia/trec-mb-2014"`

TREC Microblog 2014 test collection.

queries

55 queries

Language: en

Query type:

TrecMb14Query: (namedtuple)

query_id: str
query: str
time: str
tweet_time: str
description: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, time, tweet_time, description>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2014 queries



[query_id]    [query]    [time]    [tweet_time]    [description]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

docs

253M docs

Inherits docs from tweets2013-ia

Language: multiple/other/unknown

Document type:

TweetDoc: (namedtuple)

doc_id: str
text: str
user_id: str
created_at: str
lang: str
reply_doc_id: str
retweet_doc_id: str
source: bytes
source_content_type: str

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2014 docs



[doc_id]    [text]    [user_id]    [created_at]    [lang]    [reply_doc_id]    [retweet_doc_id]    [source]    [source_content_type]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels

58K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`47K`	81.6%
1	relevant	`4.8K`	8.2%
2	highly relevant	`5.9K`	10.2%

Examples:

Python API

import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export tweets2013-ia/trec-mb-2014 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

Citation

ir_datasets.bib:

\cite{Lin2014Microblog,Sequiera2017TweetsIA}

Bibtex:

@inproceedings{Lin2014Microblog, title={Overview of the TREC-2014 Microblog Track}, author={Jimmy Lin and Miles Efron and Yulu Wang and Garrick Sherman}, booktitle={TREC}, year={2014} } @inproceedings{Sequiera2017TweetsIA, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }

Metadata

{
  "docs": {
    "count": 252713133,
    "fields": {
      "doc_id": {
        "max_len": 18,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 55
  },
  "qrels": {
    "count": 57985,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 47340,
          "2": 5892,
          "1": 4753
        }
      }
    }
  }
}

ir_datasets: Tweets 2013 (Internet Archive)

"tweets2013-ia"

"tweets2013-ia/trec-mb-2013"

"tweets2013-ia/trec-mb-2014"

`ir_datasets`: Tweets 2013 (Internet Archive)

`"tweets2013-ia"`

`"tweets2013-ia/trec-mb-2013"`

`"tweets2013-ia/trec-mb-2014"`