← home
Github: datasets/dpr_w100.py

ir_datasets: DPR Wiki100

Index
  1. dpr-w100
  2. dpr-w100/natural-questions/dev
  3. dpr-w100/natural-questions/train
  4. dpr-w100/trivia-qa/dev
  5. dpr-w100/trivia-qa/train

"dpr-w100"

A wikipedia dump from 20 December, 2018, split into passages of 100 words. Used in experiments in the DPR paper (and other subsequent works) for retrieval experiments over Q&A collections.

docsCitation

Language: en

Document type:
DprW100Doc: (namedtuple)
  1. doc_id: str
  2. text: str
  3. title: str

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("dpr-w100")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, title>

You can find more details about the Python API here.


"dpr-w100/natural-questions/dev"

Dev subset from the Natural Questions Q&A collection. This differs from the natural-questions/dev dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

queriesdocsqrelsCitation

Language: en

Query type:
DprW100Query: (namedtuple)
  1. query_id: str
  2. text: str
  3. answers: Tuple[str]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>

You can find more details about the Python API here.


"dpr-w100/natural-questions/train"

Training subset from the Natural Questions Q&A collection. This differs from the natural-questions/train dataset in that it uses the full Wikipedia dump and additional filtering (described in the DPR paper) was applied.

queriesdocsqrelsCitation

Language: en

Query type:
DprW100Query: (namedtuple)
  1. query_id: str
  2. text: str
  3. answers: Tuple[str]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("dpr-w100/natural-questions/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>

You can find more details about the Python API here.


"dpr-w100/trivia-qa/dev"

Dev subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

queriesdocsqrelsCitation

Language: en

Query type:
DprW100Query: (namedtuple)
  1. query_id: str
  2. text: str
  3. answers: Tuple[str]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>

You can find more details about the Python API here.


"dpr-w100/trivia-qa/train"

Training subset from the Trivia QA dataset. Differing from the official Trivia QA collection, this uses the DPR Wikipedia dump as the source collection. Refer to the DPR paper for more details.

queriesdocsqrelsCitation

Language: en

Query type:
DprW100Query: (namedtuple)
  1. query_id: str
  2. text: str
  3. answers: Tuple[str]

Examples:

Python APICLIPyTerrier
import ir_datasets
dataset = ir_datasets.load("dpr-w100/trivia-qa/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, answers>

You can find more details about the Python API here.