ir_datasets: DPR Wiki100A wikipedia dump from 20 December, 2018, split into passages of 100 words. Used in experiments in the DPR paper (and other subsequent works) for retrieval experiments over Q&A collections.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export dpr-w100 docs
[doc_id]    [text]    [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Dev subset from the Natural Questions Q&A collection. This differs from the natural-questions/dev dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>
You can find more details about the Python API here.
ir_datasets export dpr-w100/natural-questions/dev queries
[query_id]    [text]    [answers]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/dev')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from dpr-w100
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export dpr-w100/natural-questions/dev docs
[doc_id]    [text]    [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/dev')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | 
|---|---|
| -1 | negative samples | 
| 0 | "hard" negative samples | 
| 1 | contains the answer text and retrieved in the top BM25 results | 
| 2 | marked by human annotator as containing the answer | 
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export dpr-w100/natural-questions/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/dev')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Training subset from the Natural Questions Q&A collection. This differs from the natural-questions/train dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>
You can find more details about the Python API here.
ir_datasets export dpr-w100/natural-questions/train queries
[query_id]    [text]    [answers]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/train')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from dpr-w100
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export dpr-w100/natural-questions/train docs
[doc_id]    [text]    [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/train')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | 
|---|---|
| -1 | negative samples | 
| 0 | "hard" negative samples | 
| 1 | contains the answer text and retrieved in the top BM25 results | 
| 2 | marked by human annotator as containing the answer | 
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export dpr-w100/natural-questions/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/train')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Dev subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>
You can find more details about the Python API here.
ir_datasets export dpr-w100/trivia-qa/dev queries
[query_id]    [text]    [answers]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/dev')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from dpr-w100
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export dpr-w100/trivia-qa/dev docs
[doc_id]    [text]    [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/dev')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | 
|---|---|
| -1 | negative samples | 
| 0 | "hard" negative samples | 
| 1 | contains the answer text and retrieved in the top BM25 results | 
| 2 | marked by human annotator as containing the answer | 
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export dpr-w100/trivia-qa/dev qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/dev')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Training subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>
You can find more details about the Python API here.
ir_datasets export dpr-w100/trivia-qa/train queries
[query_id]    [text]    [answers]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/train')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from dpr-w100
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>
You can find more details about the Python API here.
ir_datasets export dpr-w100/trivia-qa/train docs
[doc_id]    [text]    [title]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/train')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])
You can find more details about PyTerrier indexing here.
Relevance levels
| Rel. | Definition | 
|---|---|
| -1 | negative samples | 
| 0 | "hard" negative samples | 
| 1 | contains the answer text and retrieved in the top BM25 results | 
| 2 | marked by human annotator as containing the answer | 
Examples:
import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export dpr-w100/trivia-qa/train qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/train')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.