← home
Github: datasets/nyt.py

ir_datasets: NYT

Index
  1. nyt
  2. nyt/train
  3. nyt/valid

"nyt"

The New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is useful for transferring relevance signals in cases where training data is in short supply.

Uses data from LDC2008T19.

docsCitation

Language: en

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. metadata_html: str
  3. headline: str
  4. lead_paragraph_html: str
  5. fulltext_html: str

Example

import ir_datasets
dataset = ir_datasets.load('nyt')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, metadata_html, headline, lead_paragraph_html, fulltext_html>

"nyt/train"

Training set (without held-out nyt/valid) for transferring relevance signals from NYT corpus.

  • Queries: Headlines
  • Relevance: Assumed headline is relevant to article
  • Paper
queriesdocsqrelsCitation

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('nyt/train')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

"nyt/valid"

Held-out validation set for transferring relevance signals from NYT corpus (see nyt/train).

queriesdocsqrels

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('nyt/valid')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>