ir_datasets
: Highwire (TREC Genomics 2006-07)Medical document collection from Highwire Press. Includes 162,259 scientific articles from 49 journals.
This dataset is used for the TREC 2006-07 TREC Genomics track.
Note that these documents are split into passages based on paragraph tags in the HTML.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("highwire")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, journal, title, spans>
You can find more details about the Python API here.
The TREC Genomics Track 2006 benchmark. Contains 28 queries with passage-level relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("highwire/trec-genomics-2006")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
The TREC Genomics Track 2007 benchmark. Contains 36 queries with passage-level relevance judgments.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("highwire/trec-genomics-2007")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.