ir_datasets : Washington Post

import ir_datasets
dataset = ir_datasets.load("wapo/v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2 docs



[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

`"wapo/v2/trec-core-2018"`

The TREC Common Core 2018 benchmark.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from wapo/v2

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: int
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 docs



[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	not relevant
1	relevant
2	highly relevant

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

`"wapo/v2/trec-news-2018"`

The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 queries



[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('doc_id'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from wapo/v2

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: int
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 docs



[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	The document provides little or no useful background information.
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.
4	The document provides significantly useful background ...
8	The document provides essential useful background ...
16	The document _must_ appear in the sidebar otherwise critical context is missing.

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('doc_id'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

bibtex: @inproceedings{Soboroff2018News, title={TREC 2018 News Track Overview}, author={Ian Soboroff and Shudong Huang and Donna Harman}, booktitle={TREC}, year={2018} }

`"wapo/v2/trec-news-2019"`

The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 queries



[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('doc_id'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from wapo/v2

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: int
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 docs



[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	The document provides little or no useful background information.
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.
4	The document provides significantly useful background ...
8	The document provides essential useful background ...
16	The document _must_ appear in the sidebar otherwise critical context is missing.

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('doc_id'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

bibtex: @inproceedings{Soboroff2019News, title={TREC 2019 News Track Overview}, author={Ian Soboroff and Shudong Huang and Donna Harman}, booktitle={TREC}, year={2019} }

`"wapo/v3/trec-news-2020"`

The TREC News 2020 Background Linking task. The task is to find relevant background information for the provided articles.

If you have a copy of the v3 dataset, we would appreciate a pull request to add support!

Queries: Articles via the doc_id field
Shared Task Website

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v3/trec-news-2020 queries



[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	The document provides little or no useful background information.
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.
4	The document provides significantly useful background ...
8	The document provides essential useful background ...
16	The document _must_ appear in the sidebar otherwise critical context is missing.

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v3/trec-news-2020 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.