ir_datasets
: MSMARCO (QnA)The MS MARCO Question Answering dataset. This is the source collection of msmarco-passage and msmarco-document.
Query IDs in this collection align with those found in msmarco-passage and msmarco-document, but document IDs do not (the QnA collection does not come with document IDs itself; these are assigned sequentially by ir_datasets.)
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url>
Official dev set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type, answers>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url>
Relevance levels
Rel. | Definition |
---|---|
0 | Not marked by annotator as a contribution to their answer |
1 | Marked by annotator as a contribution to their answer |
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
Official eval set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url>
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
Official train set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type, answers>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url>
Relevance levels
Rel. | Definition |
---|---|
0 | Not marked by annotator as a contribution to their answer |
1 | Marked by annotator as a contribution to their answer |
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>