← home
Github: datasets/nfcorpus.py

ir_datasets: NFCorpus (NutritionFacts)

Index
  1. nfcorpus
  2. nfcorpus/dev
  3. nfcorpus/dev/nontopic
  4. nfcorpus/dev/video
  5. nfcorpus/test
  6. nfcorpus/test/nontopic
  7. nfcorpus/test/video
  8. nfcorpus/train
  9. nfcorpus/train/nontopic
  10. nfcorpus/train/video

"nfcorpus"

"NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed."

docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
Citation
bibtex: @inproceedings{boteva16full, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/dev"

Official dev set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

queries

Language: en

Query type:
NfCorpusQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. all: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, all>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"nfcorpus/dev/nontopic"

Official dev set, filtered to exclude queries from topic pages.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev/nontopic')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev/nontopic')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev/nontopic')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"nfcorpus/dev/video"

Official dev set, filtered to only include queries from video pages.

queries

Language: en

Query type:
NfCorpusVideoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. desc: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev/video')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, desc>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev/video')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/dev/video')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"nfcorpus/test"

Official test set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

queries

Language: en

Query type:
NfCorpusQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. all: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, all>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"nfcorpus/test/nontopic"

Official test set, filtered to exclude queries from topic pages.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test/nontopic')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test/nontopic')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test/nontopic')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"nfcorpus/test/video"

Official test set, filtered to only include queries from video pages.

queries

Language: en

Query type:
NfCorpusVideoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. desc: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test/video')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, desc>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test/video')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/test/video')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"nfcorpus/train"

Official train set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

queries

Language: en

Query type:
NfCorpusQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. all: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, all>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"nfcorpus/train/nontopic"

Official train set, filtered to exclude queries from topic pages.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train/nontopic')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train/nontopic')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train/nontopic')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

"nfcorpus/train/video"

Official train set, filtered to only include queries from video pages.

queries

Language: en

Query type:
NfCorpusVideoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. desc: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train/video')
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, desc>
docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train/video')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>
qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Example

import ir_datasets
dataset = ir_datasets.load('nfcorpus/train/video')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>