← home
Github: datasets/wapo.py

ir_datasets: Washington Post

Index
  1. wapo
  2. wapo/v2
  3. wapo/v2/trec-core-2018
  4. wapo/v2/trec-news-2018
  5. wapo/v2/trec-news-2019
  6. wapo/v3/trec-news-2020

Data Access Information

To use this dataset, you need a copy of Washington Post Collection, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

The source file required is WashingtonPost.v2.tar.gz.

ir_datasets expects the above file to be copied/linked under ~/.ir_datasets/wapo/WashingtonPost.v2.tar.gz.


"wapo"

The Washington Post collection.


"wapo/v2"

Version 2 of the Washington Post collection, consisting of articles published between 2012-2017.

The collection is obtained from NIST by requesting it from NIST here.

body contains all body text in plain text format, including paragrphs and multi-media captions. body_paras_html contains only source paragraphs and contains HTML markup. body_media contains images, videos, tweets, and galeries, along with a link to the content and a textual caption.

docs

Language: en

Document type:
WapoDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. author: str
  5. published_date: int
  6. kicker: str
  7. body: str
  8. body_paras_html: Tuple[str, ...]
  9. body_media: Tuple[
    WapoDocMedia: (namedtuple)
    1. type: str
    2. url: str
    3. text: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2 docs
[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.


"wapo/v2/trec-core-2018"

The TREC Common Core 2018 benchmark.

  • Queries: TREC-style (keyword, description, narrative)
  • Relevance: Deeply-annotated
  • Shared Task Website
queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-core-2018 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from wapo/v2

Document type:
WapoDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. author: str
  5. published_date: int
  6. kicker: str
  7. body: str
  8. body_paras_html: Tuple[str, ...]
  9. body_media: Tuple[
    WapoDocMedia: (namedtuple)
    1. type: str
    2. url: str
    3. text: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-core-2018 docs
[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-core-2018 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.


"wapo/v2/trec-news-2018"

The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.

queries

Language: en

Query type:
TrecBackgroundLinkingQuery: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-news-2018 queries
[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('doc_id'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from wapo/v2

Document type:
WapoDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. author: str
  5. published_date: int
  6. kicker: str
  7. body: str
  8. body_paras_html: Tuple[str, ...]
  9. body_media: Tuple[
    WapoDocMedia: (namedtuple)
    1. type: str
    2. url: str
    3. text: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-news-2018 docs
[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0The document provides little or no useful background information.
2The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.
4The document provides significantly useful background ...
8The document provides essential useful background ...
16The document _must_ appear in the sidebar otherwise critical context is missing.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-news-2018 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('doc_id'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @inproceedings{Soboroff2018News, title={TREC 2018 News Track Overview}, author={Ian Soboroff and Shudong Huang and Donna Harman}, booktitle={TREC}, year={2018} }

"wapo/v2/trec-news-2019"

The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.

queries

Language: en

Query type:
TrecBackgroundLinkingQuery: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-news-2019 queries
[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('doc_id'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from wapo/v2

Document type:
WapoDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. author: str
  5. published_date: int
  6. kicker: str
  7. body: str
  8. body_paras_html: Tuple[str, ...]
  9. body_media: Tuple[
    WapoDocMedia: (namedtuple)
    1. type: str
    2. url: str
    3. text: str
    , ...]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-news-2019 docs
[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0The document provides little or no useful background information.
2The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.
4The document provides significantly useful background ...
8The document provides essential useful background ...
16The document _must_ appear in the sidebar otherwise critical context is missing.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v2/trec-news-2019 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('doc_id'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation
bibtex: @inproceedings{Soboroff2019News, title={TREC 2019 News Track Overview}, author={Ian Soboroff and Shudong Huang and Donna Harman}, booktitle={TREC}, year={2019} }

"wapo/v3/trec-news-2020"

The TREC News 2020 Background Linking task. The task is to find relevant background information for the provided articles.

If you have a copy of the v3 dataset, we would appreciate a pull request to add support!

queries

Language: en

Query type:
TrecBackgroundLinkingQuery: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. url: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v3/trec-news-2020 queries
[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0The document provides little or no useful background information.
2The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.
4The document provides significantly useful background ...
8The document provides essential useful background ...
16The document _must_ appear in the sidebar otherwise critical context is missing.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export wapo/v3/trec-news-2020 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier