ir_datasets
: NYTThe New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is useful for transferring relevance signals in cases where training data is in short supply.
Uses data from LDC2008T19.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('nyt')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, metadata_html, headline, lead_paragraph_html, fulltext_html>
Training set (without held-out nyt/valid) for transferring relevance signals from NYT corpus.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('nyt/train')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('nyt/train')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, metadata_html, headline, lead_paragraph_html, fulltext_html>
Relevance levels
Rel. | Definition |
---|---|
1 | title is associated with article body |
Example
import ir_datasets
dataset = ir_datasets.load('nyt/train')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
Held-out validation set for transferring relevance signals from NYT corpus (see nyt/train).
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('nyt/valid')
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('nyt/valid')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, metadata_html, headline, lead_paragraph_html, fulltext_html>
Relevance levels
Rel. | Definition |
---|---|
1 | title is associated with article body |
Example
import ir_datasets
dataset = ir_datasets.load('nyt/valid')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>