ir_datasets : Washington Post

import ir_datasets
dataset = ir_datasets.load("wapo/v2")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2 docs



[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

{
  "docs": {
    "count": 595037,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  }
}

`"wapo/v2/trec-core-2018"`

The TREC Common Core 2018 benchmark.

Queries: TREC-style (keyword, description, narrative)
Relevance: Deeply-annotated
Shared Task Website

50 queries

Language: en

Query type:

TrecQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 queries



[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

595K docs

Inherits docs from wapo/v2

Language: en

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: int
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 docs



[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels

26K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	not relevant	`22K`	85.0%
1	relevant	`2.1K`	7.9%
2	highly relevant	`1.9K`	7.1%

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-core-2018")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-core-2018 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-core-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

{
  "docs": {
    "count": 595037,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 26233,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 22285,
          "2": 1865,
          "1": 2083
        }
      }
    }
  }
}

`"wapo/v2/trec-news-2018"`

The TREC News 2018 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

50 queries

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 queries



[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('doc_id'))

You can find more details about PyTerrier retrieval here.

docs

595K docs

Inherits docs from wapo/v2

Language: en

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: int
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 docs



[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels

8.5K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	The document provides little or no useful background information.	`6.5K`	76.0%
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.	`1.2K`	14.0%
4	The document provides significantly useful background ...	`584`	6.9%
8	The document provides essential useful background ...	`164`	1.9%
16	The document _must_ appear in the sidebar otherwise critical context is missing.	`106`	1.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2018")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2018 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2018')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('doc_id'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Soboroff2018News}

Bibtex:

@inproceedings{Soboroff2018News, title={TREC 2018 News Track Overview}, author={Ian Soboroff and Shudong Huang and Donna Harman}, booktitle={TREC}, year={2018} }

{
  "docs": {
    "count": 595037,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 50
  },
  "qrels": {
    "count": 8508,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "16": 106,
          "2": 1189,
          "0": 6465,
          "4": 584,
          "8": 164
        }
      }
    }
  }
}

`"wapo/v2/trec-news-2019"`

The TREC News 2019 Background Linking task. The task is to find relevant background information for the provided articles.

Queries: Articles via the doc_id field
Shared Task Website
Sared task paper

60 queries

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 queries



[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('doc_id'))

You can find more details about PyTerrier retrieval here.

docs

595K docs

Inherits docs from wapo/v2

Language: en

Document type:

WapoDoc: (namedtuple)

doc_id: str
url: str
title: str
author: str
published_date: int
kicker: str
body: str
body_paras_html: Tuple[str, ...]
body_media: Tuple[
WapoDocMedia: (namedtuple)
1. type: str
2. url: str
3. text: str
, ...]

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, author, published_date, kicker, body, body_paras_html, body_media>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 docs



[doc_id]    [url]    [title]    [author]    [published_date]    [kicker]    [body]    [body_paras_html]    [body_media]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
# Index wapo/v2
indexer = pt.IterDictIndexer('./indices/wapo_v2')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'author', 'kicker', 'body'])

You can find more details about PyTerrier indexing here.

qrels

16K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	The document provides little or no useful background information.	`13K`	80.6%
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.	`1.7K`	10.7%
4	The document provides significantly useful background ...	`660`	4.2%
8	The document provides essential useful background ...	`431`	2.8%
16	The document _must_ appear in the sidebar otherwise critical context is missing.	`273`	1.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v2/trec-news-2019")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v2/trec-news-2019 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wapo/v2/trec-news-2019')
index_ref = pt.IndexRef.of('./indices/wapo_v2') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('doc_id'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Soboroff2019News}

Bibtex:

@inproceedings{Soboroff2019News, title={TREC 2019 News Track Overview}, author={Ian Soboroff and Shudong Huang and Donna Harman}, booktitle={TREC}, year={2019} }

{
  "docs": {
    "count": 595037,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 60
  },
  "qrels": {
    "count": 15655,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1677,
          "0": 12614,
          "8": 431,
          "16": 273,
          "4": 660
        }
      }
    }
  }
}

`"wapo/v3/trec-news-2020"`

The TREC News 2020 Background Linking task. The task is to find relevant background information for the provided articles.

If you have a copy of the v3 dataset, we would appreciate a pull request to add support!

Queries: Articles via the doc_id field
Shared Task Website

50 queries

Language: en

Query type:

TrecBackgroundLinkingQuery: (namedtuple)

query_id: str
doc_id: str
url: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, doc_id, url>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v3/trec-news-2020 queries



[query_id]    [doc_id]    [url]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

18K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	The document provides little or no useful background information.	`15K`	86.4%
2	The document provides some useful background or contextual information that would help the user understand the broader story context of the target article.	`1.6K`	9.0%
4	The document provides significantly useful background ...	`631`	3.6%
8	The document provides essential useful background ...	`132`	0.7%
16	The document _must_ appear in the sidebar otherwise critical context is missing.	`50`	0.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("wapo/v3/trec-news-2020")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wapo/v3/trec-news-2020 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier