ir_datasets
: MSMARCO (QnA)The MS MARCO Question Answering dataset. This is the source collection of msmarco-passage and msmarco-document.
Query IDs in this collection align with those found in msmarco-passage and msmarco-document. The collection does not provide doc_ids, so these are assigned in the following format: [msmarco_passage_id]-[url_seq]
, where [msmarco_passage_id]
is the document from msmarco-passage that has matching contents and [url_seq]
is assigned sequentially for each URL encountered. In other words, all documents with the same prefix have the same text; they only differ in the originating document.
Doc msmarco_passage_id
fields are assigned by matching pasasge contents in msmarco-passage, and this field is provided for every document. Doc msmarco_document_id
fields are assigned by matching the URL to the one found in msmarco-document. Due to how msmarco-document was constructed, there is not necessarily a match (value will be None
if no match).
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>
Official dev set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type, answers>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>
Relevance levels
Rel. | Definition |
---|---|
0 | Not marked by annotator as a contribution to their answer |
1 | Marked by annotator as a contribution to their answer |
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
Official eval set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
Official train set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type, answers>
Language: en
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>
Relevance levels
Rel. | Definition |
---|---|
0 | Not marked by annotator as a contribution to their answer |
1 | Marked by annotator as a contribution to their answer |
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
Example
import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>