`ir_datasets`: MSMARCO (QnA)

Index

msmarco-qna
msmarco-qna/dev
msmarco-qna/eval
msmarco-qna/train

`"msmarco-qna"`

The MS MARCO Question Answering dataset. This is the source collection of msmarco-passage and msmarco-document.

It is prohibited to use information from this dataset for submissions to the MS MARCO passage and document leaderboards or the TREC DL shared task.

Query IDs in this collection align with those found in msmarco-passage and msmarco-document. The collection does not provide doc_ids, so these are assigned in the following format: [msmarco_passage_id]-[url_seq], where [msmarco_passage_id] is the document from msmarco-passage that has matching contents and [url_seq] is assigned sequentially for each URL encountered. In other words, all documents with the same prefix have the same text; they only differ in the originating document.

Doc msmarco_passage_id fields are assigned by matching pasasge contents in msmarco-passage, and this field is provided for every document. Doc msmarco_document_id fields are assigned by matching the URL to the one found in msmarco-document. Due to how msmarco-document was constructed, there is not necessarily a match (value will be None if no match).

Documents: Short passages (from web)
Queries: Natural language questions (from query log), including type and natural-language answers.
Leaderboard
Dataset Paper
More information

docs

Language: en

Document type:

MsMarcoQnADoc: (namedtuple)

doc_id: str
text: str
url: str
msmarco_passage_id: str
msmarco_document_id: str

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>

Citation

bibtex: @inproceedings{Bajaj2016MSMA, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, Tong Wang}, booktitle={InCoCo@NIPS}, year={2016} }

`"msmarco-qna/dev"`

Official dev set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

queries

Language: en

Query type:

MsMarcoQnAQuery: (namedtuple)

query_id: str
text: str
type: str
answers: Tuple[str, ...]

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, type, answers>

docs

Language: en

Document type:

MsMarcoQnADoc: (namedtuple)

doc_id: str
text: str
url: str
msmarco_passage_id: str
msmarco_document_id: str

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	Not marked by annotator as a contribution to their answer
1	Marked by annotator as a contribution to their answer

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/dev')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

`"msmarco-qna/eval"`

Official eval set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

queries

Language: en

Query type:

MsMarcoQnAEvalQuery: (namedtuple)

query_id: str
text: str
type: str

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, type>

docs

Language: en

Document type:

MsMarcoQnADoc: (namedtuple)

doc_id: str
text: str
url: str
msmarco_passage_id: str
msmarco_document_id: str

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>

scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/eval')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

`"msmarco-qna/train"`

Official train set.

The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.

queries

Language: en

Query type:

MsMarcoQnAQuery: (namedtuple)

query_id: str
text: str
type: str
answers: Tuple[str, ...]

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for query in dataset.queries_iter():
    query # namedtuple<query_id, text, type, answers>

docs

Language: en

Document type:

MsMarcoQnADoc: (namedtuple)

doc_id: str
text: str
url: str
msmarco_passage_id: str
msmarco_document_id: str

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>

qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition
0	Not marked by annotator as a contribution to their answer
1	Marked by annotator as a contribution to their answer

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Example


import ir_datasets
dataset = ir_datasets.load('msmarco-qna/train')
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

ir_datasets: MSMARCO (QnA)

"msmarco-qna"

"msmarco-qna/dev"

"msmarco-qna/eval"

"msmarco-qna/train"

`ir_datasets`: MSMARCO (QnA)

`"msmarco-qna"`

`"msmarco-qna/dev"`

`"msmarco-qna/eval"`

`"msmarco-qna/train"`