← home
Github: datasets/nyt.py

ir_datasets: NYT

Index
  1. nyt
  2. nyt/trec-core-2017
  3. nyt/wksup
  4. nyt/wksup/train
  5. nyt/wksup/valid

Data Access Information

To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2008T19.

Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.

The source file is: nyt_corpus_LDC2008T19.tgz.

ir_datasets expects this file to be copied/linked as ~/.ir_datasets/nyt/nyt.tgz.


"nyt"

The New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is used in TREC Core 2017 and it is also useful for transferring relevance signals in cases where training data is in short supply.

Uses data from LDC2008T19. The source collection can be downloaded from the LDC.

docsCitationMetadata
1.9M docs

Language: en

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. headline: str
  3. body: str
  4. source_xml: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("nyt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, headline, body, source_xml>

You can find more details about the Python API here.


"nyt/trec-core-2017"

The TREC Common Core 2017 benchmark.

Note that this dataset only contains the 50 queries assessed by NIST.

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("nyt/trec-core-2017")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.


"nyt/wksup"

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

queriesdocsqrelsCitationMetadata
1.9M queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("nyt/wksup")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"nyt/wksup/train"

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

queriesdocsqrelsCitationMetadata
1.9M queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"nyt/wksup/valid"

Held-out validation set for transferring relevance signals from NYT corpus (see nyt/wksup/train).

queriesdocsqrelsCitationMetadata
1.0K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/valid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.