← home
Github: datasets/tweets2013_ia.py

ir_datasets: Tweets 2013 (Internet Archive)

Index
  1. tweets2013-ia
  2. tweets2013-ia/trec-mb-2013
  3. tweets2013-ia/trec-mb-2014

"tweets2013-ia"

A collection of tweets from a 2-month window achived by the Internet Achive. This collection can be a stand-in document collection for the TREC Microblog 2013-14 tasks. (Even though it is not exactly the same collection, Sequiera and Lin show that it it close enough.)

This collection is automatically downloaded from the Internet Archive, though download speeds are often slow so it takes some time. ir_datasets constructs a new directory hierarchy during the download process to facilitate fast lookups and slices.

docs

Language: multiple/other/unknown

Document type:
TweetDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. user_id: str
  4. created_at: str
  5. lang: str
  6. reply_doc_id: str
  7. retweet_doc_id: str
  8. source: bytes
  9. source_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('tweets2013-ia')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>
Citation
bibtex: @inproceedings{Sequiera2017Finally, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }

"tweets2013-ia/trec-mb-2013"

TREC Microblog 2013 test collection.

queries

Language: en

Query type:
TrecMb13Query: (namedtuple)
  1. query_id: str
  2. query: str
  3. time: str
  4. tweet_time: str

Example

import ir_datasets
dataset = ir_datasets.load('tweets2013-ia/trec-mb-2013')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, time, tweet_time>
docs

Language: multiple/other/unknown

Document type:
TweetDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. user_id: str
  4. created_at: str
  5. lang: str
  6. reply_doc_id: str
  7. retweet_doc_id: str
  8. source: bytes
  9. source_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('tweets2013-ia/trec-mb-2013')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Example

import ir_datasets
dataset = ir_datasets.load('tweets2013-ia/trec-mb-2013')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Lin2013Microblog, title={Overview of the TREC-2013 Microblog Track}, author={Jimmy Lin and Miles Efron}, booktitle={TREC}, year={2013} }

"tweets2013-ia/trec-mb-2014"

TREC Microblog 2014 test collection.

queries

Language: en

Query type:
TrecMb14Query: (namedtuple)
  1. query_id: str
  2. query: str
  3. time: str
  4. tweet_time: str
  5. description: str

Example

import ir_datasets
dataset = ir_datasets.load('tweets2013-ia/trec-mb-2014')
for query in dataset.queries_iter():
    query # namedtuple<query_id, query, time, tweet_time, description>
docs

Language: multiple/other/unknown

Document type:
TweetDoc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. user_id: str
  4. created_at: str
  5. lang: str
  6. reply_doc_id: str
  7. retweet_doc_id: str
  8. source: bytes
  9. source_content_type: str

Example

import ir_datasets
dataset = ir_datasets.load('tweets2013-ia/trec-mb-2014')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Example

import ir_datasets
dataset = ir_datasets.load('tweets2013-ia/trec-mb-2014')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
Citation
bibtex: @inproceedings{Lin2014Microblog, title={Overview of the TREC-2014 Microblog Track}, author={Jimmy Lin and Miles Efron and Yulu Wang and Garrick Sherman}, booktitle={TREC}, year={2014} }