← home
Github: datasets/highwire.py

ir_datasets: Highwire (TREC Genomics 2006-07)

Index
  1. highwire
  2. highwire/trec-genomics-2006
  3. highwire/trec-genomics-2007

"highwire"

Medical document collection from Highwire Press. Includes 162,259 scientific articles from 49 journals.

This dataset is used for the TREC 2006-07 TREC Genomics track.

Note that these documents are split into passages based on paragraph tags in the HTML.

docsMetadata
162K docs

Language: en

Document type:
HighwireDoc: (namedtuple)
  1. doc_id: str
  2. journal: str
  3. title: str
  4. spans: Tuple[
    HighwireSpan: (namedtuple)
    1. start: int
    2. length: int
    3. text: str
    , ...]

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("highwire")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, journal, title, spans>

You can find more details about the Python API here.


"highwire/trec-genomics-2006"

The TREC Genomics Track 2006 benchmark. Contains 28 queries with passage-level relevance judgments.

queriesdocsqrelsCitationMetadata
28 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("highwire/trec-genomics-2006")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"highwire/trec-genomics-2007"

The TREC Genomics Track 2007 benchmark. Contains 36 queries with passage-level relevance judgments.

queriesdocsqrelsCitationMetadata
36 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("highwire/trec-genomics-2007")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.