← home
Github: datasets/nyt.py

ir_datasets: NYT

Index
  1. nyt
  2. nyt/trec-core-2017
  3. nyt/wksup
  4. nyt/wksup/train
  5. nyt/wksup/valid

Data Access Information

To use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2008T19.

Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.

The source file is: nyt_corpus_LDC2008T19.tgz.

ir_datasets expects this file to be copied/linked as ~/.ir_datasets/nyt/nyt.tgz.


"nyt"

The New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is used in TREC Core 2017 and it is also useful for transferring relevance signals in cases where training data is in short supply.

Uses data from LDC2008T19. The source collection can be downloaded from the LDC.

docs

Language: en

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. headline: str
  3. body: str
  4. source_xml: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, headline, body, source_xml>

You can find more details about the Python API here.

CLI
ir_datasets export nyt docs
[doc_id]    [headline]    [body]    [source_xml]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Sandhaus2008Nyt}

Bibtex:

@article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }

"nyt/trec-core-2017"

The TREC Common Core 2017 benchmark.

Note that this dataset only contains the 50 queries assessed by NIST.

queries

Language: en

Query type:
TrecQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/trec-core-2017")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/trec-core-2017 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/trec-core-2017')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nyt

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. headline: str
  3. body: str
  4. source_xml: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/trec-core-2017")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, headline, body, source_xml>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/trec-core-2017 docs
[doc_id]    [headline]    [body]    [source_xml]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/trec-core-2017')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0not relevant
1relevant
2highly relevant

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/trec-core-2017")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/trec-core-2017 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nyt/trec-core-2017')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Allan2017TrecCore,Sandhaus2008Nyt}

Bibtex:

@inproceedings{Allan2017TrecCore, author = {James Allan and Donna Harman and Evangelos Kanoulas and Dan Li and Christophe Van Gysel and Ellen Vorhees}, title = {TREC 2017 Common Core Track Overview}, booktitle = {TREC}, year = {2017} } @article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }

"nyt/wksup"

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nyt

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. headline: str
  3. body: str
  4. source_xml: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, headline, body, source_xml>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup docs
[doc_id]    [headline]    [body]    [source_xml]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.Definition
1title is associated with article body

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{MacAvaney2019Wksup,Sandhaus2008Nyt}

Bibtex:

@inproceedings{MacAvaney2019Wksup, author = {MacAvaney, Sean and Yates, Andrew and Hui, Kai and Frieder, Ophir}, title = {Content-Based Weak Supervision for Ad-Hoc Re-Ranking}, booktitle = {SIGIR}, year = {2019} } @article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }

"nyt/wksup/train"

Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup/train queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/train')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nyt

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. headline: str
  3. body: str
  4. source_xml: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, headline, body, source_xml>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup/train docs
[doc_id]    [headline]    [body]    [source_xml]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/train')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.Definition
1title is associated with article body

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup/train qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/train')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{MacAvaney2019Wksup,Sandhaus2008Nyt}

Bibtex:

@inproceedings{MacAvaney2019Wksup, author = {MacAvaney, Sean and Yates, Andrew and Hui, Kai and Frieder, Ophir}, title = {Content-Based Weak Supervision for Ad-Hoc Re-Ranking}, booktitle = {SIGIR}, year = {2019} } @article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }

"nyt/wksup/valid"

Held-out validation set for transferring relevance signals from NYT corpus (see nyt/wksup/train).

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/valid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup/valid queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/valid')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nyt

Document type:
NytDoc: (namedtuple)
  1. doc_id: str
  2. headline: str
  3. body: str
  4. source_xml: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/valid")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, headline, body, source_xml>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup/valid docs
[doc_id]    [headline]    [body]    [source_xml]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/valid')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
GenericQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int

Relevance levels

Rel.Definition
1title is associated with article body

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/valid")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance>

You can find more details about the Python API here.

CLI
ir_datasets export nyt/wksup/valid qrels --format tsv
[query_id]    [doc_id]    [relevance]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/valid')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{MacAvaney2019Wksup,Sandhaus2008Nyt}

Bibtex:

@inproceedings{MacAvaney2019Wksup, author = {MacAvaney, Sean and Yates, Andrew and Hui, Kai and Frieder, Ophir}, title = {Content-Based Weak Supervision for Ad-Hoc Re-Ranking}, booktitle = {SIGIR}, year = {2019} } @article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }