ir_datasets: WikIRA suite of IR benchmarks in multiple languages built from Wikipeida.
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }A small version of WikIR for English.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en1k')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 369721,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}
Test set of wikir/en1k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/test queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en1k.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en1k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/test docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en1k.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 4.3K | 97.7% | 
| 2 | Query is the article title | 100 | 2.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en1k.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/test scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en1k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 369721,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 100
  },
  "qrels": {
    "count": 4435,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 100,
          "1": 4335
        }
      }
    }
  },
  "scoreddocs": {
    "count": 10000
  }
}
Training set of wikir/en1k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/training queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en1k.training.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en1k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/training docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en1k.training')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 46K | 97.0% | 
| 2 | Query is the article title | 1.4K | 3.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/training qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en1k.training.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/training scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en1k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 369721,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1444
  },
  "qrels": {
    "count": 47699,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1444,
          "1": 46255
        }
      }
    }
  },
  "scoreddocs": {
    "count": 144400
  }
}
Validation set of wikir/en1k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/validation queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en1k.validation.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en1k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/validation docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en1k.validation')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 4.9K | 98.0% | 
| 2 | Query is the article title | 100 | 2.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/validation qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en1k.validation.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/validation scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en1k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 369721,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 100
  },
  "qrels": {
    "count": 4979,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 100,
          "1": 4879
        }
      }
    }
  },
  "scoreddocs": {
    "count": 10000
  }
}
WikIR for English.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en59k')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2454785,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}
Test set of wikir/en59k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/test queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en59k.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en59k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/test docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en59k.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 104K | 99.0% | 
| 2 | Query is the article title | 1.0K | 1.0% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en59k.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/test scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en59k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2454785,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1000
  },
  "qrels": {
    "count": 104715,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1000,
          "1": 103715
        }
      }
    }
  },
  "scoreddocs": {
    "count": 100000
  }
}
Training set of wikir/en59k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/training queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en59k.training.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en59k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/training docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en59k.training')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 2.4M | 97.7% | 
| 2 | Query is the article title | 57K | 2.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/training qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en59k.training.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/training scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en59k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2454785,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 57251
  },
  "qrels": {
    "count": 2443383,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 57251,
          "1": 2386132
        }
      }
    }
  },
  "scoreddocs": {
    "count": 5725100
  }
}
Validation set of wikir/en59k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/validation queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en59k.validation.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en59k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/validation docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en59k.validation')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 68K | 98.5% | 
| 2 | Query is the article title | 1.0K | 1.5% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/validation qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en59k.validation.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/validation scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en59k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2454785,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1000
  },
  "qrels": {
    "count": 68905,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1000,
          "1": 67905
        }
      }
    }
  },
  "scoreddocs": {
    "count": 100000
  }
}
WikIR for English. This is one of the two versions used in Frej2020Wikir.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en78k')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}
Test set of wikir/en78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/test queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en78k.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/test docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en78k.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 345K | 97.8% | 
| 2 | Query is the article title | 7.9K | 2.2% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en78k.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/test scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en78k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7862
  },
  "qrels": {
    "count": 353060,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 7862,
          "1": 345198
        }
      }
    }
  },
  "scoreddocs": {
    "count": 785600
  }
}
Training set of wikir/en78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/training queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en78k.training.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/training docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en78k.training')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 2.4M | 97.4% | 
| 2 | Query is the article title | 63K | 2.6% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/training qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en78k.training.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/training scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en78k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 62904
  },
  "qrels": {
    "count": 2435257,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 62904,
          "1": 2372353
        }
      }
    }
  },
  "scoreddocs": {
    "count": 6284800
  }
}
Validation set of wikir/en78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/validation queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en78k.validation.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/validation docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en78k.validation')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 264K | 97.1% | 
| 2 | Query is the article title | 7.9K | 2.9% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/validation qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en78k.validation.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/validation scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en78k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7862
  },
  "qrels": {
    "count": 271874,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 7862,
          "1": 264012
        }
      }
    }
  },
  "scoreddocs": {
    "count": 785700
  }
}
WikIR for English, using the first sentences of articles as queries. This is one of the two versions used in Frej2020Wikir.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.ens78k')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  }
}
Test set of wikir/ens78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/test queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.ens78k.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/ens78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/test docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.ens78k.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 345K | 97.8% | 
| 2 | Query is the article title | 7.9K | 2.2% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.ens78k.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/test scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.ens78k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7862
  },
  "qrels": {
    "count": 353060,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 7862,
          "1": 345198
        }
      }
    }
  },
  "scoreddocs": {
    "count": 786100
  }
}
Training set of wikir/ens78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/training queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.ens78k.training.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/ens78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/training docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.ens78k.training')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 2.4M | 97.4% | 
| 2 | Query is the article title | 63K | 2.6% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/training qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.ens78k.training.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/training scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.ens78k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 62904
  },
  "qrels": {
    "count": 2435257,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 62904,
          "1": 2372353
        }
      }
    }
  },
  "scoreddocs": {
    "count": 6289800
  }
}
Validation set of wikir/ens78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/validation queries
[query_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.ens78k.validation.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/ens78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/validation docs
[doc_id]    [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.ens78k.validation')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 264K | 97.1% | 
| 2 | Query is the article title | 7.9K | 2.9% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/validation qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics(),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.ens78k.validation.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/validation scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.ens78k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 2456637,
    "fields": {
      "doc_id": {
        "max_len": 7,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 7862
  },
  "qrels": {
    "count": 271874,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 7862,
          "1": 264012
        }
      }
    }
  },
  "scoreddocs": {
    "count": 786100
  }
}
WikIR for Spanish.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.es13k')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 645901,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  }
}
Test set of wikir/es13k. Scoreddocs are the provided BM25 run.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/test queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.es13k.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/es13k
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/test docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.es13k.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 70K | 98.2% | 
| 2 | Query is the article title | 1.3K | 1.8% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.es13k.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/test scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.es13k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 645901,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1300
  },
  "qrels": {
    "count": 71339,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1300,
          "1": 70039
        }
      }
    }
  },
  "scoreddocs": {
    "count": 130000
  }
}
Training set of wikir/es13k. Scoreddocs are the provided BM25 run.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/training queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.es13k.training.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/es13k
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/training docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.es13k.training')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 466K | 97.7% | 
| 2 | Query is the article title | 11K | 2.3% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/training qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.es13k.training.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/training scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.es13k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 645901,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 11202
  },
  "qrels": {
    "count": 477212,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 11202,
          "1": 466010
        }
      }
    }
  },
  "scoreddocs": {
    "count": 1120200
  }
}
Validation set of wikir/es13k. Scoreddocs are the provided BM25 run.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/validation queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.es13k.validation.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/es13k
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/validation docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.es13k.validation')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 57K | 97.8% | 
| 2 | Query is the article title | 1.3K | 2.2% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/validation qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.es13k.validation.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/validation scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.es13k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 645901,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1300
  },
  "qrels": {
    "count": 58757,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1300,
          "1": 57457
        }
      }
    }
  },
  "scoreddocs": {
    "count": 130000
  }
}
WikIR for French.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.fr14k')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 736616,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  }
}
Test set of wikir/fr14k. Scoreddocs are the provided BM25 run.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/test queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.fr14k.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/fr14k
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/test docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.fr14k.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 54K | 97.5% | 
| 2 | Query is the article title | 1.4K | 2.5% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.fr14k.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/test scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.fr14k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 736616,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1400
  },
  "qrels": {
    "count": 55647,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1400,
          "1": 54247
        }
      }
    }
  },
  "scoreddocs": {
    "count": 140000
  }
}
Training set of wikir/fr14k. Scoreddocs are the provided BM25 run.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/training queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.fr14k.training.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/fr14k
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/training docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.fr14k.training')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 598K | 98.1% | 
| 2 | Query is the article title | 11K | 1.9% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/training qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.fr14k.training.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/training scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.fr14k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 736616,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 11341
  },
  "qrels": {
    "count": 609240,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 11341,
          "1": 597899
        }
      }
    }
  },
  "scoreddocs": {
    "count": 1134100
  }
}
Validation set of wikir/fr14k. Scoreddocs are the provided BM25 run.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/validation queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.fr14k.validation.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/fr14k
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/validation docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.fr14k.validation')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 80K | 98.3% | 
| 2 | Query is the article title | 1.4K | 1.7% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/validation qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.fr14k.validation.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/validation scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.fr14k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 736616,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1400
  },
  "qrels": {
    "count": 81255,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1400,
          "1": 79855
        }
      }
    }
  },
  "scoreddocs": {
    "count": 140000
  }
}
WikIR for Italian.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.it16k')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 503012,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  }
}
Test set of wikir/it16k. Scoreddocs are the provided BM25 run.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/test queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.it16k.test.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/it16k
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/test docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.it16k.test')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 48K | 96.8% | 
| 2 | Query is the article title | 1.6K | 3.2% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/test qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.it16k.test.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/test scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.it16k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 503012,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1600
  },
  "qrels": {
    "count": 49338,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1600,
          "1": 47738
        }
      }
    }
  },
  "scoreddocs": {
    "count": 160000
  }
}
Training set of wikir/it16k. Scoreddocs are the provided BM25 run.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/training queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.it16k.training.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/it16k
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/training docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.it16k.training')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 369K | 96.5% | 
| 2 | Query is the article title | 13K | 3.5% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/training qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.it16k.training.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/training scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.it16k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 503012,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 13418
  },
  "qrels": {
    "count": 381920,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 13418,
          "1": 368502
        }
      }
    }
  },
  "scoreddocs": {
    "count": 1341800
  }
}
Validation set of wikir/it16k. Scoreddocs are the provided BM25 run.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/validation queries
[query_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.it16k.validation.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/it16k
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/validation docs
[doc_id]    [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.it16k.validation')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % | 
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% | 
| 1 | There is a link to the article with the query as its title in the first sentence | 43K | 96.4% | 
| 2 | Query is the article title | 1.6K | 3.6% | 
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/validation qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.it16k.validation.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for scoreddoc in dataset.scoreddocs_iter():
    scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/validation scoreddocs --format tsv
[query_id]    [doc_id]    [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.it16k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun 
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
  "docs": {
    "count": 503012,
    "fields": {
      "doc_id": {
        "max_len": 6,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 1600
  },
  "qrels": {
    "count": 45003,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "2": 1600,
          "1": 43403
        }
      }
    }
  },
  "scoreddocs": {
    "count": 160000
  }
}