ir_datasets
: Tweets 2013 (Internet Archive)A collection of tweets from a 2-month window achived by the Internet Achive. This collection can be a stand-in document collection for the TREC Microblog 2013-14 tasks. (Even though it is not exactly the same collection, Sequiera and Lin show that it it close enough.)
This collection is automatically downloaded from the Internet Archive, though download speeds are often slow so it takes some time. ir_datasets constructs a new directory hierarchy during the download process to facilitate fast lookups and slices.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("tweets2013-ia")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>
You can find more details about the Python API here.
ir_datasets export tweets2013-ia docs
[doc_id] [text] [user_id] [created_at] [lang] [reply_doc_id] [retweet_doc_id] [source] [source_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.tweets2013-ia')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Sequiera2017TweetsIA, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }{ "docs": { "count": 252713133, "fields": { "doc_id": { "max_len": 18, "common_prefix": "" } } } }
TREC Microblog 2013 test collection.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, time, tweet_time>
You can find more details about the Python API here.
ir_datasets export tweets2013-ia/trec-mb-2013 queries
[query_id] [query] [time] [tweet_time]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.tweets2013-ia.trec-mb-2013.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from tweets2013-ia
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>
You can find more details about the Python API here.
ir_datasets export tweets2013-ia/trec-mb-2013 docs
[doc_id] [text] [user_id] [created_at] [lang] [reply_doc_id] [retweet_doc_id] [source] [source_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.tweets2013-ia.trec-mb-2013')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 62K | 87.4% |
1 | relevant | 5.9K | 8.2% |
2 | highly relevant | 3.2K | 4.4% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2013")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tweets2013-ia/trec-mb-2013 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.tweets2013-ia.trec-mb-2013.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Lin2013Microblog, title={Overview of the TREC-2013 Microblog Track}, author={Jimmy Lin and Miles Efron}, booktitle={TREC}, year={2013} } @inproceedings{Sequiera2017TweetsIA, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }{ "docs": { "count": 252713133, "fields": { "doc_id": { "max_len": 18, "common_prefix": "" } } }, "queries": { "count": 60 }, "qrels": { "count": 71279, "fields": { "relevance": { "counts_by_value": { "0": 62268, "1": 5856, "2": 3155 } } } } }
TREC Microblog 2014 test collection.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for query in dataset.queries_iter():
query # namedtuple<query_id, query, time, tweet_time, description>
You can find more details about the Python API here.
ir_datasets export tweets2013-ia/trec-mb-2014 queries
[query_id] [query] [time] [tweet_time] [description]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.tweets2013-ia.trec-mb-2014.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from tweets2013-ia
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, user_id, created_at, lang, reply_doc_id, retweet_doc_id, source, source_content_type>
You can find more details about the Python API here.
ir_datasets export tweets2013-ia/trec-mb-2014 docs
[doc_id] [text] [user_id] [created_at] [lang] [reply_doc_id] [retweet_doc_id] [source] [source_content_type]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.tweets2013-ia.trec-mb-2014')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 47K | 81.6% |
1 | relevant | 4.8K | 8.2% |
2 | highly relevant | 5.9K | 10.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("tweets2013-ia/trec-mb-2014")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export tweets2013-ia/trec-mb-2014 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.tweets2013-ia.trec-mb-2014.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@inproceedings{Lin2014Microblog, title={Overview of the TREC-2014 Microblog Track}, author={Jimmy Lin and Miles Efron and Yulu Wang and Garrick Sherman}, booktitle={TREC}, year={2014} } @inproceedings{Sequiera2017TweetsIA, title={Finally, a Downloadable Test Collection of Tweets}, author={Royal Sequiera and Jimmy Lin}, booktitle={SIGIR}, year={2017} }{ "docs": { "count": 252713133, "fields": { "doc_id": { "max_len": 18, "common_prefix": "" } } }, "queries": { "count": 55 }, "qrels": { "count": 57985, "fields": { "relevance": { "counts_by_value": { "0": 47340, "2": 5892, "1": 4753 } } } } }