← home
Github: datasets/msmarco_passage.py

ir_datasets: MSMARCO (passage)

Index
  1. msmarco-passage
  2. msmarco-passage/dev
  3. msmarco-passage/dev/2
  4. msmarco-passage/dev/judged
  5. msmarco-passage/dev/small
  6. msmarco-passage/eval
  7. msmarco-passage/eval/small
  8. msmarco-passage/train
  9. msmarco-passage/train/judged
  10. msmarco-passage/train/medical
  11. msmarco-passage/train/split200-train
  12. msmarco-passage/train/split200-valid
  13. msmarco-passage/train/triples-small
  14. msmarco-passage/train/triples-v2
  15. msmarco-passage/trec-dl-2019
  16. msmarco-passage/trec-dl-2019/judged
  17. msmarco-passage/trec-dl-2020
  18. msmarco-passage/trec-dl-2020/judged
  19. msmarco-passage/trec-dl-hard
  20. msmarco-passage/trec-dl-hard/fold1
  21. msmarco-passage/trec-dl-hard/fold2
  22. msmarco-passage/trec-dl-hard/fold3
  23. msmarco-passage/trec-dl-hard/fold4
  24. msmarco-passage/trec-dl-hard/fold5

"msmarco-passage"

A passage ranking benchmark with a collection of 8.8 million passages and question queries. Most relevance judgments are shallow (typically at most 1-2 per query), but the TREC Deep Learning track adds deep judgments. Evaluation typically conducted using MRR@10.

Note that the original document source files for this collection contain a double-encoding error that cause strange sequences like "å¬" and "ðºð". These are automatically corrrected (properly converting previous examples to "公" and "🇺🇸").

docsCitationMetadata
8.8M docs

Language: en

Document type:
GenericDoc: (namedtuple)
  1. doc_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.


"msmarco-passage/dev"

Official dev set.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available dev queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Official evaluation measures: RR@10

queriesdocsqrelsCitationMetadata
101K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/dev/2"

"Dev2" split of the msmarco-passage/dev set. Originally released as part of the v2 corpus.

Official evaluation measures: RR@10

queriesdocsqrelsCitationMetadata
4.3K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/dev/judged"

Subset of msmarco-passage/dev that only includes queries that have at least one qrel.

Official evaluation measures: RR@10

queriesdocsqrelsCitationMetadata
56K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/dev/small"

Official "small" version of the dev set, consisting of 6,980 queries (6.9% of the full dev set).

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsCitationMetadata
7.0K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/dev/small")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/eval"

Official eval set for submission to MS MARCO leaderboard. Relevance judgments are hidden.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available eval queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

Official evaluation measures: RR@10

queriesdocsCitationMetadata
101K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/eval/small"

Official "small" version of the eval set, consisting of 6,837 queries (6.8% of the full eval set).

Official evaluation measures: RR@10

queriesdocsscoreddocsCitationMetadata
6.8K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/eval/small")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/train"

Official train set.

Not all queries have relevance judgments. Use msmarco-passage/train/judged for a filtered list that only includes documents that have at least one qrel.

scoreddocs are the top 1000 results from BM25. These are used for the "re-ranking" setting. Note that these are sub-sampled to about 1/8 of the total available train queries by the MSMARCO authors for faster evaluation. The BM25 scores from scoreddocs are not available (all have a score of 0).

docpairs provides access to the "official" sequence for pairwise training.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsdocpairsCitationMetadata
809K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/train/judged"

Subset of msmarco-passage/train that only includes queries that have at least one qrel.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsdocpairsCitationMetadata
503K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/train/medical"

Subset of msmarco-passage/train that only includes queries that have a layman or expert medical term. Note that this includes about 20% false matches due to terms with multiple senses.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsdocpairsCitationMetadata
79K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/medical")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/train/split200-train"

Subset of msmarco-passage/train without 200 queries that are meant to be used as a small validation set. From various works.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsdocpairsCitationMetadata
809K queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-train")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/train/split200-valid"

Subset of msmarco-passage/train with only 200 queries that are meant to be used as a small validation set. From various works.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsdocpairsCitationMetadata
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/split200-valid")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/train/triples-small"

Version of msmarco-passage/train, but with the "small" triples file (a 10% sample of the full file).

Note that to save on storage space (27GB), the contents of the file are mapped to their corresponding query and document IDs. This process takes a few minutes to run the first time the triples are requested.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsdocpairsCitationMetadata
809K queries

Inherits queries from msmarco-passage/train

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-small")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/train/triples-v2"

Version of msmarco-passage/train, but with version 2 of the triples file.

This version of the triples file includes rows that were accidently missing from version 1 of the file (see discussion here).

Note that this is sorted by the IDs in the file, so you probably would not want to use it unless you first shuffle it before usage. We opened an issue suggesting that a third version of the file is provided that is shuffled so that the order is consistent across groups using the data, but at this time, no such file exists in an official capacity.

Official evaluation measures: RR@10

queriesdocsqrelsscoreddocsdocpairsCitationMetadata
809K queries

Inherits queries from msmarco-passage/train

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/train/triples-v2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-2019"

Queries from the TREC Deep Learning (DL) 2019 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2019/judged).

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

queriesdocsqrelsscoreddocsCitationMetadata
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-2019/judged"

Subset of msmarco-passage/trec-dl-2019, only including queries with qrels.

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

queriesdocsqrelsscoreddocsCitationMetadata
43 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2019/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-2020"

Queries from the TREC Deep Learning (DL) 2020 shared task, which were sampled from msmarco-passage/eval. A subset of these queries were judged by NIST assessors, (filtered list available in msmarco-passage/trec-dl-2020/judged).

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

queriesdocsqrelsscoreddocsCitationMetadata
200 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-2020/judged"

Subset of msmarco-passage/trec-dl-2020, only including queries with qrels.

Official evaluation measures: nDCG@10, RR(rel=2), AP(rel=2)

queriesdocsqrelsscoreddocsCitationMetadata
54 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-2020/judged")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-hard"

A more challenging subset of msmarco-passage/trec-dl-2019 and msmarco-document/trec-dl-2020.

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
50 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-hard/fold1"

Fold 1 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold1")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-hard/fold2"

Fold 2 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold2")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-hard/fold3"

Fold 3 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-hard/fold4"

Fold 4 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold4")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.


"msmarco-passage/trec-dl-hard/fold5"

Fold 5 of msmarco-passage/trec-dl-hard

Official evaluation measures: nDCG@10, RR(rel=2)

queriesdocsqrelsCitationMetadata
10 queries

Language: en

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python APICLIPyTerrierXPM-IR
import ir_datasets
dataset = ir_datasets.load("msmarco-passage/trec-dl-hard/fold5")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.