ir_datasets
: MSMARCO (QnA)The MS MARCO Question Answering dataset. This is the source collection of msmarco-passage and msmarco-document.
Query IDs in this collection align with those found in msmarco-passage and msmarco-document. The collection does not provide doc_ids, so these are assigned in the following format: [msmarco_passage_id]-[url_seq]
, where [msmarco_passage_id]
is the document from msmarco-passage that has matching contents and [url_seq]
is assigned sequentially for each URL encountered. In other words, all documents with the same prefix have the same text; they only differ in the originating document.
Doc msmarco_passage_id
fields are assigned by matching pasasge contents in msmarco-passage, and this field is provided for every document. Doc msmarco_document_id
fields are assigned by matching the URL to the one found in msmarco-document. Due to how msmarco-document was constructed, there is not necessarily a match (value will be None
if no match).
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-qna docs
[doc_id] [text] [url] [msmarco_passage_id] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna')
# Index msmarco-qna
indexer = pt.IterDictIndexer('./indices/msmarco-qna')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'url', 'msmarco_passage_id', 'msmarco_document_id'])
You can find more details about PyTerrier indexing here.
Official dev set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/dev")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type, answers>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/dev queries
[query_id] [text] [type] [answers]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-qna') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-qna
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/dev")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/dev docs
[doc_id] [text] [url] [msmarco_passage_id] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna/dev')
# Index msmarco-qna
indexer = pt.IterDictIndexer('./indices/msmarco-qna')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'url', 'msmarco_passage_id', 'msmarco_document_id'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not marked by annotator as a contribution to their answer |
1 | Marked by annotator as a contribution to their answer |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/dev")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/dev qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna/dev')
index_ref = pt.IndexRef.of('./indices/msmarco-qna') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/dev")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/dev scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Official eval set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/eval")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/eval queries
[query_id] [text] [type]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna/eval')
index_ref = pt.IndexRef.of('./indices/msmarco-qna') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-qna
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/eval")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/eval docs
[doc_id] [text] [url] [msmarco_passage_id] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna/eval')
# Index msmarco-qna
indexer = pt.IterDictIndexer('./indices/msmarco-qna')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'url', 'msmarco_passage_id', 'msmarco_document_id'])
You can find more details about PyTerrier indexing here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/eval")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/eval scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
Official train set.
The scoreddocs provides the roughtly 10 passages presented to the user for annotation, where the score indicates the order presented.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text, type, answers>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/train queries
[query_id] [text] [type] [answers]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna/train')
index_ref = pt.IndexRef.of('./indices/msmarco-qna') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('text'))
You can find more details about PyTerrier retrieval here.
Language: en
Note: Uses docs from msmarco-qna
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, url, msmarco_passage_id, msmarco_document_id>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/train docs
[doc_id] [text] [url] [msmarco_passage_id] [msmarco_document_id]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna/train')
# Index msmarco-qna
indexer = pt.IterDictIndexer('./indices/msmarco-qna')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'url', 'msmarco_passage_id', 'msmarco_document_id'])
You can find more details about PyTerrier indexing here.
Relevance levels
Rel. | Definition |
---|---|
0 | Not marked by annotator as a contribution to their answer |
1 | Marked by annotator as a contribution to their answer |
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:msmarco-qna/train')
index_ref = pt.IndexRef.of('./indices/msmarco-qna') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics('text'),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
Examples:
import ir_datasets
dataset = ir_datasets.load("msmarco-qna/train")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export msmarco-qna/train scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier