← home
Github: datasets/nyt.py

ir_datasets: NYT

Index
  1. nyt
  2. nyt/train
  3. nyt/valid

"nyt"

The New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is useful for transferring relevance signals in cases where training data is in short supply.

Uses data from LDC2008T19.

docs

Language: en

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. metadata_html: str
  3. headline: str
  4. lead_paragraph_html: str
  5. fulltext_html: str

Example

import ir_datasets
dataset = ir_datasets.load('nyt')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, metadata_html, headline, lead_paragraph_html, fulltext_html>
Citation
bibtex: @article{sandhaus2008nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }

"nyt/train"

Training set (without held-out nyt/valid) for transferring relevance signals from NYT corpus.

  • Queries: Headlines
  • Relevance: Assumed headline is relevant to article
  • Paper
queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('nyt/train')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. metadata_html: str
  3. headline: str
  4. lead_paragraph_html: str
  5. fulltext_html: str

Example

import ir_datasets
dataset = ir_datasets.load('nyt/train')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, metadata_html, headline, lead_paragraph_html, fulltext_html>
qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.Definition
1title is associated with article body

Example

import ir_datasets
dataset = ir_datasets.load('nyt/train')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>
Citation
bibtex: @inproceedings{macavaney:sigir2019-nyt, author = {MacAvaney, Sean and Yates, Andrew and Hui, Kai and Frieder, Ophir}, title = {Content-Based Weak Supervision for Ad-Hoc Re-Ranking}, booktitle = {SIGIR}, year = {2019} }

"nyt/valid"

Held-out validation set for transferring relevance signals from NYT corpus (see nyt/train).

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('nyt/valid')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. metadata_html: str
  3. headline: str
  4. lead_paragraph_html: str
  5. fulltext_html: str

Example

import ir_datasets
dataset = ir_datasets.load('nyt/valid')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, metadata_html, headline, lead_paragraph_html, fulltext_html>
qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.Definition
1title is associated with article body

Example

import ir_datasets
dataset = ir_datasets.load('nyt/valid')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>