← home
Github: datasets/nfcorpus.py

ir_datasets: NFCorpus (NutritionFacts)

Index
  1. nfcorpus
  2. nfcorpus/dev
  3. nfcorpus/dev/nontopic
  4. nfcorpus/dev/video
  5. nfcorpus/test
  6. nfcorpus/test/nontopic
  7. nfcorpus/test/video
  8. nfcorpus/train
  9. nfcorpus/train/nontopic
  10. nfcorpus/train/video

"nfcorpus"

"NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed."

docs

Language: en

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/dev"

Official dev set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

queries

Language: en

Query type:
NfCorpusQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. all: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, all>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev queries
[query_id]    [title]    [all]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/dev/nontopic"

Official dev set, filtered to exclude queries from topic pages.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev/nontopic")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev/nontopic queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev/nontopic')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev/nontopic")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev/nontopic docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev/nontopic')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev/nontopic")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev/nontopic qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev/nontopic')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/dev/video"

Official dev set, filtered to only include queries from video pages.

queries

Language: en

Query type:
NfCorpusVideoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. desc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev/video")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, desc>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev/video queries
[query_id]    [title]    [desc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev/video')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev/video")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev/video docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev/video')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/dev/video")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/dev/video qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/dev/video')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/test"

Official test set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

queries

Language: en

Query type:
NfCorpusQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. all: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, all>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test queries
[query_id]    [title]    [all]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/test/nontopic"

Official test set, filtered to exclude queries from topic pages.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test/nontopic")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test/nontopic queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test/nontopic')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test/nontopic")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test/nontopic docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test/nontopic')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test/nontopic")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test/nontopic qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test/nontopic')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/test/video"

Official test set, filtered to only include queries from video pages.

queries

Language: en

Query type:
NfCorpusVideoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. desc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test/video")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, desc>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test/video queries
[query_id]    [title]    [desc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test/video')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test/video")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test/video docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test/video')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/test/video")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/test/video qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/test/video')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/train"

Official train set. Queries include both title and combinted "all" text field (titles, descriptions, topics, transcripts and comments)

queries

Language: en

Query type:
NfCorpusQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. all: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, all>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train queries
[query_id]    [title]    [all]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/train/nontopic"

Official train set, filtered to exclude queries from topic pages.

queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train/nontopic")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train/nontopic queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train/nontopic')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train/nontopic")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train/nontopic docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train/nontopic')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train/nontopic")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train/nontopic qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train/nontopic')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

"nfcorpus/train/video"

Official train set, filtered to only include queries from video pages.

queries

Language: en

Query type:
NfCorpusVideoQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. desc: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train/video")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, desc>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train/video queries
[query_id]    [title]    [desc]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train/video')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs

Language: en

Note: Uses docs from nfcorpus

Document type:
NfCorpusDoc: (namedtuple)
  1. doc_id: str
  2. url: str
  3. title: str
  4. abstract: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train/video")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, url, title, abstract>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train/video docs
[doc_id]    [url]    [title]    [abstract]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train/video')
# Index nfcorpus
indexer = pt.IterDictIndexer('./indices/nfcorpus')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'title', 'abstract'])

You can find more details about PyTerrier indexing here.

qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.Definition
0Marginally relevant, based on topic containment.
1A link exists from the query to another query that directly links to the document.
2A direct link from the query to the document the cited sources section of a page.

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("nfcorpus/train/video")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export nfcorpus/train/video qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:nfcorpus/train/video')
index_ref = pt.IndexRef.of('./indices/nfcorpus') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Boteva2016Nfcorpus}

Bibtex:

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }