ir_datasets
: NYTTo use this dataset, you need a copy of the source corpus, provided by the the Linguistic Data Consortium. The specific resource needed is LDC2008T19.
Many organizations already have a subscription to the LDC, so access to the collection can be as easy as confirming the data usage agreement and downloading the corpus. Check with your library for access details.
The source file is: nyt_corpus_LDC2008T19.tgz.
ir_datasets expects this file to be copied/linked as ~/.ir_datasets/nyt/nyt.tgz.
The New York Times Annotated Corpus. Consists of articles published between 1987 and 2007. It is used in TREC Core 2017 and it is also useful for transferring relevance signals in cases where training data is in short supply.
Uses data from LDC2008T19. The source collection can be downloaded from the LDC.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, headline, body, source_xml>
You can find more details about the Python API here.
ir_datasets export nyt docs
[doc_id] [headline] [body] [source_xml]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])
You can find more details about PyTerrier indexing here.
Bibtex:
@article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }{ "docs": { "count": 1864661, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } } }
The TREC Common Core 2017 benchmark.
Note that this dataset only contains the 50 queries assessed by NIST.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/trec-core-2017")
for query in dataset.queries_iter():
query # namedtuple<query_id, title, description, narrative>
You can find more details about the Python API here.
ir_datasets export nyt/trec-core-2017 queries
[query_id] [title] [description] [narrative]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/trec-core-2017')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))
You can find more details about PyTerrier retrieval here.
Inherits docs from nyt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/trec-core-2017")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, headline, body, source_xml>
You can find more details about the Python API here.
ir_datasets export nyt/trec-core-2017 docs
[doc_id] [headline] [body] [source_xml]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/trec-core-2017')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
0 | not relevant | 21K | 70.0% |
1 | relevant | 5.5K | 18.5% |
2 | highly relevant | 3.5K | 11.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/trec-core-2017")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export nyt/trec-core-2017 qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nyt/trec-core-2017')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('title'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{Allan2017TrecCore, author = {James Allan and Donna Harman and Evangelos Kanoulas and Dan Li and Christophe Van Gysel and Ellen Vorhees}, title = {TREC 2017 Common Core Track Overview}, booktitle = {TREC}, year = {2017} } @article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }{ "docs": { "count": 1864661, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 50 }, "qrels": { "count": 30030, "fields": { "relevance": { "counts_by_value": { "1": 5549, "0": 21028, "2": 3453 } } } } }
Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nyt/wksup queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from nyt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, headline, body, source_xml>
You can find more details about the Python API here.
ir_datasets export nyt/wksup docs
[doc_id] [headline] [body] [source_xml]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | title is associated with article body | 1.9M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export nyt/wksup qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{MacAvaney2019Wksup, author = {MacAvaney, Sean and Yates, Andrew and Hui, Kai and Frieder, Ophir}, title = {Content-Based Weak Supervision for Ad-Hoc Re-Ranking}, booktitle = {SIGIR}, year = {2019} } @article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }{ "docs": { "count": 1864661, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 1864661 }, "qrels": { "count": 1864661, "fields": { "relevance": { "counts_by_value": { "1": 1864661 } } } } }
Training set (without held-out nyt/wksup/valid) for transferring relevance signals from NYT corpus.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nyt/wksup/train queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/train')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from nyt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, headline, body, source_xml>
You can find more details about the Python API here.
ir_datasets export nyt/wksup/train docs
[doc_id] [headline] [body] [source_xml]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/train')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | title is associated with article body | 1.9M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export nyt/wksup/train qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/train')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{MacAvaney2019Wksup, author = {MacAvaney, Sean and Yates, Andrew and Hui, Kai and Frieder, Ophir}, title = {Content-Based Weak Supervision for Ad-Hoc Re-Ranking}, booktitle = {SIGIR}, year = {2019} } @article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }{ "docs": { "count": 1864661, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 1863657 }, "qrels": { "count": 1863657, "fields": { "relevance": { "counts_by_value": { "1": 1863657 } } } } }
Held-out validation set for transferring relevance signals from NYT corpus (see nyt/wksup/train).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/valid")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export nyt/wksup/valid queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/valid')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
Inherits docs from nyt
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/valid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, headline, body, source_xml>
You can find more details about the Python API here.
ir_datasets export nyt/wksup/valid docs
[doc_id] [headline] [body] [source_xml]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/valid')
# Index nyt
indexer = pt.IterDictIndexer('./indices/nyt')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['headline', 'body'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition | Count | % |
---|---|---|---|
1 | title is associated with article body | 1.0K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("nyt/wksup/valid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance>
You can find more details about the Python API here.
ir_datasets export nyt/wksup/valid qrels --format tsv
[query_id] [doc_id] [relevance]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nyt/wksup/valid')
index_ref = pt.IndexRef.of('./indices/nyt') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Bibtex:
@inproceedings{MacAvaney2019Wksup, author = {MacAvaney, Sean and Yates, Andrew and Hui, Kai and Frieder, Ophir}, title = {Content-Based Weak Supervision for Ad-Hoc Re-Ranking}, booktitle = {SIGIR}, year = {2019} } @article{Sandhaus2008Nyt, title={The new york times annotated corpus}, author={Sandhaus, Evan}, journal={Linguistic Data Consortium, Philadelphia}, volume={6}, number={12}, pages={e26752}, year={2008} }{ "docs": { "count": 1864661, "fields": { "doc_id": { "max_len": 7, "common_prefix": "" } } }, "queries": { "count": 1004 }, "qrels": { "count": 1004, "fields": { "relevance": { "counts_by_value": { "1": 1004 } } } } }