ir_datasets : WikIR

import ir_datasets
dataset = ir_datasets.load("wikir/en1k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 369721,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}

`"wikir/en1k/test"`

Test set of wikir/en1k. Scoreddocs are the provided BM25 run.

100 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

370K docs

Inherits docs from wikir/en1k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

4.4K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`4.3K`	97.7%
2	Query is the article title	`100`	2.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

10K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/test scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 369721,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 100
  },
  "qrels": {
    "count": 4435,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 100,
          "1": 4335
        }
      }
    }
  },
  "scoreddocs": {
    "count": 10000
  }
}

`"wikir/en1k/training"`

Training set of wikir/en1k. Scoreddocs are the provided BM25 run.

1.4K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/training queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

370K docs

Inherits docs from wikir/en1k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/training docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

48K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`46K`	97.0%
2	Query is the article title	`1.4K`	3.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/training qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

144K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/training scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 369721,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1444
  },
  "qrels": {
    "count": 47699,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1444,
          "1": 46255
        }
      }
    }
  },
  "scoreddocs": {
    "count": 144400
  }
}

`"wikir/en1k/validation"`

Validation set of wikir/en1k. Scoreddocs are the provided BM25 run.

100 queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/validation queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

370K docs

Inherits docs from wikir/en1k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/validation docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

5.0K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`4.9K`	98.0%
2	Query is the article title	`100`	2.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/validation qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

10K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en1k/validation scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 369721,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 100
  },
  "qrels": {
    "count": 4979,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 100,
          "1": 4879
        }
      }
    }
  },
  "scoreddocs": {
    "count": 10000
  }
}

`"wikir/en59k"`

WikIR for English.

docs

2.5M docs

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2454785,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}

`"wikir/en59k/test"`

Test set of wikir/en59k. Scoreddocs are the provided BM25 run.

1.0K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/en59k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

105K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`104K`	99.0%
2	Query is the article title	`1.0K`	1.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

100K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/test scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2454785,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1000
  },
  "qrels": {
    "count": 104715,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1000,
          "1": 103715
        }
      }
    }
  },
  "scoreddocs": {
    "count": 100000
  }
}

`"wikir/en59k/training"`

Training set of wikir/en59k. Scoreddocs are the provided BM25 run.

57K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/training queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/en59k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/training docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

2.4M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`2.4M`	97.7%
2	Query is the article title	`57K`	2.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/training qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

5.7M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/training scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2454785,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 57251
  },
  "qrels": {
    "count": 2443383,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 57251,
          "1": 2386132
        }
      }
    }
  },
  "scoreddocs": {
    "count": 5725100
  }
}

`"wikir/en59k/validation"`

Validation set of wikir/en59k. Scoreddocs are the provided BM25 run.

1.0K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/validation queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/en59k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/validation docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

69K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`68K`	98.5%
2	Query is the article title	`1.0K`	1.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/validation qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

100K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en59k/validation scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2454785,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1000
  },
  "qrels": {
    "count": 68905,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1000,
          "1": 67905
        }
      }
    }
  },
  "scoreddocs": {
    "count": 100000
  }
}

`"wikir/en78k"`

WikIR for English. This is one of the two versions used in Frej2020Wikir.

docs

2.5M docs

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}

`"wikir/en78k/test"`

Test set of wikir/en78k. Scoreddocs are the provided BM25 run.

7.9K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/en78k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

353K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`345K`	97.8%
2	Query is the article title	`7.9K`	2.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

786K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/test scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7862
  },
  "qrels": {
    "count": 353060,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 7862,
          "1": 345198
        }
      }
    }
  },
  "scoreddocs": {
    "count": 785600
  }
}

`"wikir/en78k/training"`

Training set of wikir/en78k. Scoreddocs are the provided BM25 run.

63K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/training queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/en78k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/training docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

2.4M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`2.4M`	97.4%
2	Query is the article title	`63K`	2.6%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/training qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

6.3M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/training scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 62904
  },
  "qrels": {
    "count": 2435257,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 62904,
          "1": 2372353
        }
      }
    }
  },
  "scoreddocs": {
    "count": 6284800
  }
}

`"wikir/en78k/validation"`

Validation set of wikir/en78k. Scoreddocs are the provided BM25 run.

7.9K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/validation queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/en78k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/validation docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

272K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`264K`	97.1%
2	Query is the article title	`7.9K`	2.9%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/validation qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

786K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/en78k/validation scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7862
  },
  "qrels": {
    "count": 271874,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 7862,
          "1": 264012
        }
      }
    }
  },
  "scoreddocs": {
    "count": 785700
  }
}

`"wikir/ens78k"`

WikIR for English, using the first sentences of articles as queries. This is one of the two versions used in Frej2020Wikir.

docs

2.5M docs

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}

`"wikir/ens78k/test"`

Test set of wikir/ens78k. Scoreddocs are the provided BM25 run.

7.9K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/ens78k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

353K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`345K`	97.8%
2	Query is the article title	`7.9K`	2.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

786K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/test scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7862
  },
  "qrels": {
    "count": 353060,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 7862,
          "1": 345198
        }
      }
    }
  },
  "scoreddocs": {
    "count": 786100
  }
}

`"wikir/ens78k/training"`

Training set of wikir/ens78k. Scoreddocs are the provided BM25 run.

63K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/training queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/ens78k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/training docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

2.4M qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`2.4M`	97.4%
2	Query is the article title	`63K`	2.6%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/training qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

6.3M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/training scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 62904
  },
  "qrels": {
    "count": 2435257,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 62904,
          "1": 2372353
        }
      }
    }
  },
  "scoreddocs": {
    "count": 6289800
  }
}

`"wikir/ens78k/validation"`

Validation set of wikir/ens78k. Scoreddocs are the provided BM25 run.

7.9K queries

Language: en

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/validation queries



[query_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())

You can find more details about PyTerrier retrieval here.

docs

2.5M docs

Inherits docs from wikir/ens78k

Language: en

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/validation docs



[doc_id]    [text]
...

You can find more details about the CLI here.

import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])

You can find more details about PyTerrier indexing here.

qrels

272K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`264K`	97.1%
2	Query is the article title	`7.9K`	2.9%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/validation qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

786K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/ens78k/validation scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7862
  },
  "qrels": {
    "count": 271874,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 7862,
          "1": 264012
        }
      }
    }
  },
  "scoreddocs": {
    "count": 786100
  }
}

`"wikir/es13k"`

WikIR for Spanish.

docs

646K docs

Language: es

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 645901,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  }
}

`"wikir/es13k/test"`

Test set of wikir/es13k. Scoreddocs are the provided BM25 run.

1.3K queries

Language: es

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

646K docs

Inherits docs from wikir/es13k

Language: es

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

71K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`70K`	98.2%
2	Query is the article title	`1.3K`	1.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

130K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/test scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 645901,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1300
  },
  "qrels": {
    "count": 71339,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1300,
          "1": 70039
        }
      }
    }
  },
  "scoreddocs": {
    "count": 130000
  }
}

`"wikir/es13k/training"`

Training set of wikir/es13k. Scoreddocs are the provided BM25 run.

11K queries

Language: es

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/training queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

646K docs

Inherits docs from wikir/es13k

Language: es

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/training docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

477K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`466K`	97.7%
2	Query is the article title	`11K`	2.3%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/training qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

1.1M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/training scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 645901,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 11202
  },
  "qrels": {
    "count": 477212,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 11202,
          "1": 466010
        }
      }
    }
  },
  "scoreddocs": {
    "count": 1120200
  }
}

`"wikir/es13k/validation"`

Validation set of wikir/es13k. Scoreddocs are the provided BM25 run.

1.3K queries

Language: es

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/validation queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

646K docs

Inherits docs from wikir/es13k

Language: es

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/validation docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

59K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`57K`	97.8%
2	Query is the article title	`1.3K`	2.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/validation qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

130K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/es13k/validation scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 645901,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1300
  },
  "qrels": {
    "count": 58757,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1300,
          "1": 57457
        }
      }
    }
  },
  "scoreddocs": {
    "count": 130000
  }
}

`"wikir/fr14k"`

WikIR for French.

docs

737K docs

Language: fr

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 736616,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  }
}

`"wikir/fr14k/test"`

Test set of wikir/fr14k. Scoreddocs are the provided BM25 run.

1.4K queries

Language: fr

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

737K docs

Inherits docs from wikir/fr14k

Language: fr

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

56K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`54K`	97.5%
2	Query is the article title	`1.4K`	2.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

140K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/test scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 736616,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1400
  },
  "qrels": {
    "count": 55647,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1400,
          "1": 54247
        }
      }
    }
  },
  "scoreddocs": {
    "count": 140000
  }
}

`"wikir/fr14k/training"`

Training set of wikir/fr14k. Scoreddocs are the provided BM25 run.

11K queries

Language: fr

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/training queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

737K docs

Inherits docs from wikir/fr14k

Language: fr

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/training docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

609K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`598K`	98.1%
2	Query is the article title	`11K`	1.9%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/training qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

1.1M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/training scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 736616,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 11341
  },
  "qrels": {
    "count": 609240,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 11341,
          "1": 597899
        }
      }
    }
  },
  "scoreddocs": {
    "count": 1134100
  }
}

`"wikir/fr14k/validation"`

Validation set of wikir/fr14k. Scoreddocs are the provided BM25 run.

1.4K queries

Language: fr

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/validation queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

737K docs

Inherits docs from wikir/fr14k

Language: fr

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/validation docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

81K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`80K`	98.3%
2	Query is the article title	`1.4K`	1.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/validation qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

140K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/fr14k/validation scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 736616,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1400
  },
  "qrels": {
    "count": 81255,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1400,
          "1": 79855
        }
      }
    }
  },
  "scoreddocs": {
    "count": 140000
  }
}

`"wikir/it16k"`

WikIR for Italian.

docs

503K docs

Language: it

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 503012,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  }
}

`"wikir/it16k/test"`

Test set of wikir/it16k. Scoreddocs are the provided BM25 run.

1.6K queries

Language: it

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/test queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

503K docs

Inherits docs from wikir/it16k

Language: it

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/test docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

49K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`48K`	96.8%
2	Query is the article title	`1.6K`	3.2%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/test qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

160K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/test scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 503012,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1600
  },
  "qrels": {
    "count": 49338,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1600,
          "1": 47738
        }
      }
    }
  },
  "scoreddocs": {
    "count": 160000
  }
}

`"wikir/it16k/training"`

Training set of wikir/it16k. Scoreddocs are the provided BM25 run.

13K queries

Language: it

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/training queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

503K docs

Inherits docs from wikir/it16k

Language: it

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/training docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

382K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`369K`	96.5%
2	Query is the article title	`13K`	3.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/training qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

1.3M scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/training scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }

{
  "docs": {
    "count": 503012,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 13418
  },
  "qrels": {
    "count": 381920,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 13418,
          "1": 368502
        }
      }
    }
  },
  "scoreddocs": {
    "count": 1341800
  }
}

`"wikir/it16k/validation"`

Validation set of wikir/it16k. Scoreddocs are the provided BM25 run.

1.6K queries

Language: it

Query type:

GenericQuery: (namedtuple)

query_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/validation queries



[query_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

docs

503K docs

Inherits docs from wikir/it16k

Language: it

Document type:

GenericDoc: (namedtuple)

doc_id: str
text: str

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/validation docs



[doc_id]    [text]
...

You can find more details about the CLI here.

No example available for PyTerrier

qrels

45K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Otherwise	`0`	0.0%
1	There is a link to the article with the query as its title in the first sentence	`43K`	96.4%
2	Query is the article title	`1.6K`	3.6%

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/validation qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

160K scoreddocs

Scored Document type:

GenericScoredDoc: (namedtuple)

query_id: str
doc_id: str
score: float

Examples:

import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>

You can find more details about the Python API here.

CLI

ir_datasets export wikir/it16k/validation scoreddocs --format tsv



[query_id]    [doc_id]    [score]
...

You can find more details about the CLI here.

No example available for PyTerrier

\cite{Frej2020Wikir,Frej2020MlWikir}

Bibtex:

@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }