ir_datasets : DPR Wiki100

import ir_datasets
dataset = ir_datasets.load("dpr-w100")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100 docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

\cite{Karpukhin2020Dpr}

Bibtex:

@misc{Karpukhin2020Dpr, title={Dense Passage Retrieval for Open-Domain Question Answering}, author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih}, year={2020}, eprint={2004.04906}, archivePrefix={arXiv}, primaryClass={cs.CL} }

{
  "docs": {
    "count": 21015324,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  }
}

`"dpr-w100/natural-questions/dev"`

Dev subset from the Natural Questions Q&A collection. This differs from the natural-questions/dev dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

See also: natural-questions

6.5K queries

Language: en

Query type:

DprW100Query: (namedtuple)

query_id: str
text: str
answers: Tuple[str]

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/natural-questions/dev queries



[query_id]    [text]    [answers]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/dev')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

21M docs

Inherits docs from dpr-w100

Language: en

Document type:

DprW100Doc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/natural-questions/dev docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/dev')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels

980K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-1	negative samples	`326K`	33.2%
0	"hard" negative samples	`603K`	61.5%
1	contains the answer text and retrieved in the top BM25 results	`45K`	4.6%
2	marked by human annotator as containing the answer	`6.5K`	0.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/natural-questions/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/dev')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Kwiatkowski2019Nq,Karpukhin2020Dpr}

Bibtex:

@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} } @misc{Karpukhin2020Dpr, title={Dense Passage Retrieval for Open-Domain Question Answering}, author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih}, year={2020}, eprint={2004.04906}, archivePrefix={arXiv}, primaryClass={cs.CL} }

{
  "docs": {
    "count": 21015324,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 6515
  },
  "qrels": {
    "count": 979893,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 6515,
          "1": 44736,
          "0": 602894,
          "-1": 325748
        }
      }
    }
  }
}

`"dpr-w100/natural-questions/train"`

Training subset from the Natural Questions Q&A collection. This differs from the natural-questions/train dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

See also: natural-questions

59K queries

Language: en

Query type:

DprW100Query: (namedtuple)

query_id: str
text: str
answers: Tuple[str]

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/natural-questions/train queries



[query_id]    [text]    [answers]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/train')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

21M docs

Inherits docs from dpr-w100

Language: en

Document type:

DprW100Doc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/natural-questions/train docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/train')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels

8.9M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-1	negative samples	`2.9M`	33.2%
0	"hard" negative samples	`5.4M`	61.5%
1	contains the answer text and retrieved in the top BM25 results	`406K`	4.6%
2	marked by human annotator as containing the answer	`59K`	0.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/natural-questions/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/natural-questions/train')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Kwiatkowski2019Nq,Karpukhin2020Dpr}

Bibtex:

@article{Kwiatkowski2019Nq, title = {Natural Questions: a Benchmark for Question Answering Research}, author = {Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Matthew Kelcey and Jacob Devlin and Kenton Lee and Kristina N. Toutanova and Llion Jones and Ming-Wei Chang and Andrew Dai and Jakob Uszkoreit and Quoc Le and Slav Petrov}, year = {2019}, journal = {TACL} } @misc{Karpukhin2020Dpr, title={Dense Passage Retrieval for Open-Domain Question Answering}, author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih}, year={2020}, eprint={2004.04906}, archivePrefix={arXiv}, primaryClass={cs.CL} }

{
  "docs": {
    "count": 21015324,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 58880
  },
  "qrels": {
    "count": 8856662,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 58880,
          "1": 405729,
          "0": 5448064,
          "-1": 2943989
        }
      }
    }
  }
}

`"dpr-w100/trivia-qa/dev"`

Dev subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

8.8K queries

Language: en

Query type:

DprW100Query: (namedtuple)

query_id: str
text: str
answers: Tuple[str]

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/trivia-qa/dev queries



[query_id]    [text]    [answers]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/dev')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

21M docs

Inherits docs from dpr-w100

Language: en

Document type:

DprW100Doc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/dev")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/trivia-qa/dev docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/dev')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels

884K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-1	negative samples	`0`	0.0%
0	"hard" negative samples	`801K`	90.6%
1	contains the answer text and retrieved in the top BM25 results	`83K`	9.4%
2	marked by human annotator as containing the answer	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/dev")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/trivia-qa/dev qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/dev')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Joshi2017TriviaQA,Karpukhin2020Dpr}

Bibtex:

@inproceedings{Joshi2017TriviaQA, title={TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension}, author={Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer}, booktitle={ACL}, year={2017} } @misc{Karpukhin2020Dpr, title={Dense Passage Retrieval for Open-Domain Question Answering}, author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih}, year={2020}, eprint={2004.04906}, archivePrefix={arXiv}, primaryClass={cs.CL} }

{
  "docs": {
    "count": 21015324,
    "fields": {
      "doc_id": {
        "max_len": 8,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 8837
  },
  "qrels": {
    "count": 883700,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 82658,
          "0": 801042
        }
      }
    }
  }
}

`"dpr-w100/trivia-qa/train"`

Training subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

79K queries

Language: en

Query type:

DprW100Query: (namedtuple)

query_id: str
text: str
answers: Tuple[str]

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/trivia-qa/train queries



[query_id]    [text]    [answers]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/train')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))

You can find more details about PyTerrier retrieval here.

docs

21M docs

Inherits docs from dpr-w100

Language: en

Document type:

DprW100Doc: (namedtuple)

doc_id: str
text: str
title: str

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/train")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/trivia-qa/train docs



[doc_id]    [text]    [title]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/train')
# Index dpr-w100
indexer = pt.IterDictIndexer('./indices/dpr-w100')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title'])

You can find more details about PyTerrier indexing here.

qrels

7.9M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
-1	negative samples	`0`	0.0%
0	"hard" negative samples	`7.1M`	90.6%
1	contains the answer text and retrieved in the top BM25 results	`741K`	9.4%
2	marked by human annotator as containing the answer	`0`	0.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/train")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export dpr-w100/trivia-qa/train qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:dpr-w100/trivia-qa/train')
index_ref = pt.IndexRef.of('./indices/dpr-w100') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('text'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

\cite{Joshi2017TriviaQA,Karpukhin2020Dpr}

Bibtex:

@inproceedings{Joshi2017TriviaQA, title={TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension}, author={Mandar Joshi and Eunsol Choi and Daniel S. Weld and Luke Zettlemoyer}, booktitle={ACL}, year={2017} } @misc{Karpukhin2020Dpr, title={Dense Passage Retrieval for Open-Domain Question Answering}, author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih}, year={2020}, eprint={2004.04906}, archivePrefix={arXiv}, primaryClass={cs.CL} }