ir_datasets
: NYTThe New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is useful for transferring relevance signals in cases where training data is in short supply.
Uses data from LDC2008T19.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('nyt')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, metadata_html, headline, lead_paragraph_html, fulltext_html>
Held-out validation set for transferring relevance signals from NYT corpus (see nyt/train).