← home
Github: datasets/wapo.py

ir_datasets: Washington Post

Index
  1. wapo
  2. wapo/v2
  3. wapo/v2/trec-core-2018
  4. wapo/v2/trec-news-2018
  5. wapo/v2/trec-news-2019
  6. wapo/v3/trec-news-2020
  7. wapo/v4

Data Access Information

To use this dataset, you need a copy of Washington Post Collection, provided by NIST.

Your organization may already have a copy. If this is the case, you may only need to complete a new "Individual Argeement". Otherwise, your organization will need to file the "Organizational agreement" with NIST. It can take some time to process, but you will end up with a password-protected download link.

The source file required is WashingtonPost.v2.tar.gz.

ir_datasets expects the above file to be copied/linked under ~/.ir_datasets/wapo/WashingtonPost.v2.tar.gz.


"wapo"

The Washington Post collection.


"wapo/v2"

Version 2 of the Washington Post collection, consisting of articles published between 2012-2017.

The collection is obtained from NIST by requesting it from NIST here.

body contains all body text in plain text format, including paragrphs and multi-media captions. body_paras_html contains only source paragraphs and contains HTML markup. body_media contains images, videos, tweets, and galeries, along with a link to the content and a textual caption.

docsMetadata
595K docs

Language: en

Document type:
WapoDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. author: str
  5. published_date: Optional[int]
  6. kicker: str
  7. body: str
  8. body_paras_html: Tuple[str, ...]
  9. body_media: Tuple[
    WapoDocMedia: (namedtuple)
    1. type: str
    2. url: str
    3. text: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("wapo/v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.


"wapo/v2/trec-core-2018"

The TREC Common Core 2018 benchmark.

  • Queries: TREC-style (keyword, description, narrative)
  • Relevance: Deeply-annotated
  • Shared Task Website
queriesdocsqrelsMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"wapo/v2/trec-news-2018"

The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecBackgroundLinkingQuery: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. url: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.


"wapo/v2/trec-news-2019"

The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.

queriesdocsqrelsCitationMetadata
60 queries

Language: en

Query type:
TrecBackgroundLinkingQuery: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. url: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.


"wapo/v3/trec-news-2020"

The TREC News 2020 Background Linking task. The task is to find relevant background information for the provided articles.

If you have a copy of the v3 dataset, we would appreciate a pull request to add support!

queriesqrelsMetadata
50 queries

Language: en

Query type:
TrecBackgroundLinkingQuery: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. url: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.


"wapo/v4"

(no description provided)

docsMetadata
729K docs

Language: en

Document type:
WapoDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. author: str
  5. published_date: Optional[int]
  6. kicker: str
  7. body: str
  8. body_paras_html: Tuple[str, ...]
  9. body_media: Tuple[
    WapoDocMedia: (namedtuple)
    1. type: str
    2. url: str
    3. text: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("wapo/v4")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.