ir_datasets
: Washington PostTo use this dataset, you need a copy of Washington Post Collection, provided by NIST.
Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.
The source file required is WashingtonPost.v2.tar.gz.
ir_datasets expects the above file to be copied/linked under ~/.ir_datasets/wapo/WashingtonPost.v2.tar.gz.
The Washington Post collection.
Version 2 of the Washington Post collection, consisting of articles published between 2012-2017.
The collection is obtained from NIST by requesting it from NIST here.
body contains all body text in plain text format, including paragrphs and multi-media captions. body_paras_html contains only source paragraphs and contains HTML markup. body_media contains images, videos, tweets, and galeries, along with a link to the content and a textual caption.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>
You can find more details about the Python API here.
ir_datasets export wapo/v2 docs
[doc_id] [url] [title] [author] [published_date] [kicker] [body] [body_paras_html] [body_media]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])
You can find more details about PyTerrier indexing here.
The TREC Common Core 2018 benchmark.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-core-2018 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from wapo/v2
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-core-2018 docs
[doc_id] [url] [title] [author] [published_date] [kicker] [body] [body_paras_html] [body_media]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | not relevant |
1 | relevant |
2 | highly relevant |
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-core-2018 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for query in dataset.queries_iter():
query # namedtuple<query_id, doc_id, url>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-news-2018 queries
[query_id] [doc_id] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('doc_id'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from wapo/v2
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-news-2018 docs
[doc_id] [url] [title] [author] [published_date] [kicker] [body] [body_paras_html] [body_media]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | The document provides little or no useful background information. |
2 | The document provides some useful background or contextual information that would help the user understand the broader story context of the target article. |
4 | The document provides significantly useful background ... |
8 | The document provides essential useful background ... |
16 | The document _must_ appear in the sidebar otherwise critical context is missing. |
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-news-2018 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('doc_id'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for query in dataset.queries_iter():
query # namedtuple<query_id, doc_id, url>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-news-2019 queries
[query_id] [doc_id] [url]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('doc_id'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from wapo/v2
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-news-2019 docs
[doc_id] [url] [title] [author] [published_date] [kicker] [body] [body_paras_html] [body_media]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | The document provides little or no useful background information. |
2 | The document provides some useful background or contextual information that would help the user understand the broader story context of the target article. |
4 | The document provides significantly useful background ... |
8 | The document provides essential useful background ... |
16 | The document _must_ appear in the sidebar otherwise critical context is missing. |
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wapo/v2/trec-news-2019 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('doc_id'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
The TREC News 2020 Background Linking task. The task is to find relevant background information for the provided articles.
If you have a copy of the v3 dataset, we would appreciate a pull request to add support!
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for query in dataset.queries_iter():
query # namedtuple<query_id, doc_id, url>
You can find more details about the Python API here.
ir_datasets export wapo/v3/trec-news-2020 queries
[query_id] [doc_id] [url]
...
You can find more details about the CLI here.
No example available for PyTerrier
Relevance levels
Rel. | Definition |
---|---|
0 | The document provides little or no useful background information. |
2 | The document provides some useful background or contextual information that would help the user understand the broader story context of the target article. |
4 | The document provides significantly useful background ... |
8 | The document provides essential useful background ... |
16 | The document _must_ appear in the sidebar otherwise critical context is missing. |
Examples:
import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wapo/v3/trec-news-2020 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier