ir_datasets: WikIRA suite of IR benchmarks in multiple languages built from Wikipeida.
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }A small version of WikIR for English.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en1k')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 369721,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
}
}
Test set of wikir/en1k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en1k.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en1k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en1k.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 4.3K | 97.7% |
| 2 | Query is the article title | 100 | 2.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en1k.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/test scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/test')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en1k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 369721,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 100
},
"qrels": {
"count": 4435,
"fields": {
"relevance": {
"counts_by_value": {
"2": 100,
"1": 4335
}
}
}
},
"scoreddocs": {
"count": 10000
}
}
Training set of wikir/en1k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/training queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en1k.training.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en1k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/training docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en1k.training')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 46K | 97.0% |
| 2 | Query is the article title | 1.4K | 3.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/training qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en1k.training.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/training")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/training scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/training')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en1k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 369721,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 1444
},
"qrels": {
"count": 47699,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1444,
"1": 46255
}
}
}
},
"scoreddocs": {
"count": 144400
}
}
Validation set of wikir/en1k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/validation queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en1k.validation.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en1k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/validation docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
# Index wikir/en1k
indexer = pt.IterDictIndexer('./indices/wikir_en1k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en1k.validation')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 4.9K | 98.0% |
| 2 | Query is the article title | 100 | 2.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/validation qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en1k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en1k.validation.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en1k/validation")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en1k/validation scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en1k/validation')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en1k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 369721,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 100
},
"qrels": {
"count": 4979,
"fields": {
"relevance": {
"counts_by_value": {
"2": 100,
"1": 4879
}
}
}
},
"scoreddocs": {
"count": 10000
}
}
WikIR for English.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en59k')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2454785,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
}
}
Test set of wikir/en59k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en59k.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en59k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en59k.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 104K | 99.0% |
| 2 | Query is the article title | 1.0K | 1.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en59k.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/test scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/test')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en59k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2454785,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 1000
},
"qrels": {
"count": 104715,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1000,
"1": 103715
}
}
}
},
"scoreddocs": {
"count": 100000
}
}
Training set of wikir/en59k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/training queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en59k.training.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en59k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/training docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en59k.training')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 2.4M | 97.7% |
| 2 | Query is the article title | 57K | 2.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/training qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en59k.training.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/training")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/training scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/training')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en59k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2454785,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 57251
},
"qrels": {
"count": 2443383,
"fields": {
"relevance": {
"counts_by_value": {
"2": 57251,
"1": 2386132
}
}
}
},
"scoreddocs": {
"count": 5725100
}
}
Validation set of wikir/en59k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/validation queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en59k.validation.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en59k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/validation docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
# Index wikir/en59k
indexer = pt.IterDictIndexer('./indices/wikir_en59k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en59k.validation')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 68K | 98.5% |
| 2 | Query is the article title | 1.0K | 1.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/validation qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en59k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en59k.validation.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en59k/validation")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en59k/validation scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en59k/validation')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en59k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2454785,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 1000
},
"qrels": {
"count": 68905,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1000,
"1": 67905
}
}
}
},
"scoreddocs": {
"count": 100000
}
}
WikIR for English. This is one of the two versions used in Frej2020Wikir.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en78k')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2456637,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
}
}
Test set of wikir/en78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en78k.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en78k.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 345K | 97.8% |
| 2 | Query is the article title | 7.9K | 2.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en78k.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/test scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/test')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en78k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2456637,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 7862
},
"qrels": {
"count": 353060,
"fields": {
"relevance": {
"counts_by_value": {
"2": 7862,
"1": 345198
}
}
}
},
"scoreddocs": {
"count": 785600
}
}
Training set of wikir/en78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/training queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en78k.training.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/training docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en78k.training')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 2.4M | 97.4% |
| 2 | Query is the article title | 63K | 2.6% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/training qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en78k.training.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/training")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/training scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/training')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en78k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2456637,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 62904
},
"qrels": {
"count": 2435257,
"fields": {
"relevance": {
"counts_by_value": {
"2": 62904,
"1": 2372353
}
}
}
},
"scoreddocs": {
"count": 6284800
}
}
Validation set of wikir/en78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/validation queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.en78k.validation.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/en78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/validation docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
# Index wikir/en78k
indexer = pt.IterDictIndexer('./indices/wikir_en78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.en78k.validation')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 264K | 97.1% |
| 2 | Query is the article title | 7.9K | 2.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/validation qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_en78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.en78k.validation.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/en78k/validation")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/en78k/validation scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/en78k/validation')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.en78k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2456637,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 7862
},
"qrels": {
"count": 271874,
"fields": {
"relevance": {
"counts_by_value": {
"2": 7862,
"1": 264012
}
}
}
},
"scoreddocs": {
"count": 785700
}
}
WikIR for English, using the first sentences of articles as queries. This is one of the two versions used in Frej2020Wikir.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.ens78k')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2456637,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
}
}
Test set of wikir/ens78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/test queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.ens78k.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/ens78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.ens78k.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 345K | 97.8% |
| 2 | Query is the article title | 7.9K | 2.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.ens78k.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/test scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/test')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.ens78k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2456637,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 7862
},
"qrels": {
"count": 353060,
"fields": {
"relevance": {
"counts_by_value": {
"2": 7862,
"1": 345198
}
}
}
},
"scoreddocs": {
"count": 786100
}
}
Training set of wikir/ens78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/training queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.ens78k.training.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/ens78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/training docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.ens78k.training')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 2.4M | 97.4% |
| 2 | Query is the article title | 63K | 2.6% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/training qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.ens78k.training.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/training")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/training scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/training')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.ens78k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2456637,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 62904
},
"qrels": {
"count": 2435257,
"fields": {
"relevance": {
"counts_by_value": {
"2": 62904,
"1": 2372353
}
}
}
},
"scoreddocs": {
"count": 6289800
}
}
Validation set of wikir/ens78k. Scoreddocs are the provided BM25 run.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/validation queries
[query_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics())
You can find more details about PyTerrier retrieval here.
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.ens78k.validation.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/ens78k
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/validation docs
[doc_id] [text]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
# Index wikir/ens78k
indexer = pt.IterDictIndexer('./indices/wikir_ens78k')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text'])
You can find more details about PyTerrier indexing here.
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.ens78k.validation')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 264K | 97.1% |
| 2 | Query is the article title | 7.9K | 2.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/validation qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
index_ref = pt.IndexRef.of('./indices/wikir_ens78k') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
[pipeline],
dataset.get_topics(),
dataset.get_qrels(),
[MAP, nDCG@20]
)
You can find more details about PyTerrier experiments here.
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.ens78k.validation.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/ens78k/validation")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/ens78k/validation scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:wikir/ens78k/validation')
dataset.get_results()
You can find more details about PyTerrier dataset API here.
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.ens78k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 2456637,
"fields": {
"doc_id": {
"max_len": 7,
"common_prefix": ""
}
}
},
"queries": {
"count": 7862
},
"qrels": {
"count": 271874,
"fields": {
"relevance": {
"counts_by_value": {
"2": 7862,
"1": 264012
}
}
}
},
"scoreddocs": {
"count": 786100
}
}
WikIR for Spanish.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.es13k')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 645901,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
}
}
Test set of wikir/es13k. Scoreddocs are the provided BM25 run.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/test queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.es13k.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/es13k
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.es13k.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 70K | 98.2% |
| 2 | Query is the article title | 1.3K | 1.8% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.es13k.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/test scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.es13k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 645901,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 1300
},
"qrels": {
"count": 71339,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1300,
"1": 70039
}
}
}
},
"scoreddocs": {
"count": 130000
}
}
Training set of wikir/es13k. Scoreddocs are the provided BM25 run.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/training queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.es13k.training.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/es13k
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/training docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.es13k.training')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 466K | 97.7% |
| 2 | Query is the article title | 11K | 2.3% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/training qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.es13k.training.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/training")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/training scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.es13k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 645901,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 11202
},
"qrels": {
"count": 477212,
"fields": {
"relevance": {
"counts_by_value": {
"2": 11202,
"1": 466010
}
}
}
},
"scoreddocs": {
"count": 1120200
}
}
Validation set of wikir/es13k. Scoreddocs are the provided BM25 run.
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/validation queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.es13k.validation.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/es13k
Language: es
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/validation docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.es13k.validation')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 57K | 97.8% |
| 2 | Query is the article title | 1.3K | 2.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/validation qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.es13k.validation.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/es13k/validation")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/es13k/validation scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.es13k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 645901,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 1300
},
"qrels": {
"count": 58757,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1300,
"1": 57457
}
}
}
},
"scoreddocs": {
"count": 130000
}
}
WikIR for French.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.fr14k')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 736616,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
}
}
Test set of wikir/fr14k. Scoreddocs are the provided BM25 run.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/test queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.fr14k.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/fr14k
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.fr14k.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 54K | 97.5% |
| 2 | Query is the article title | 1.4K | 2.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.fr14k.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/test scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.fr14k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 736616,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 1400
},
"qrels": {
"count": 55647,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1400,
"1": 54247
}
}
}
},
"scoreddocs": {
"count": 140000
}
}
Training set of wikir/fr14k. Scoreddocs are the provided BM25 run.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/training queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.fr14k.training.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/fr14k
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/training docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.fr14k.training')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 598K | 98.1% |
| 2 | Query is the article title | 11K | 1.9% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/training qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.fr14k.training.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/training")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/training scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.fr14k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 736616,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 11341
},
"qrels": {
"count": 609240,
"fields": {
"relevance": {
"counts_by_value": {
"2": 11341,
"1": 597899
}
}
}
},
"scoreddocs": {
"count": 1134100
}
}
Validation set of wikir/fr14k. Scoreddocs are the provided BM25 run.
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/validation queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.fr14k.validation.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/fr14k
Language: fr
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/validation docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.fr14k.validation')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 80K | 98.3% |
| 2 | Query is the article title | 1.4K | 1.7% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/validation qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.fr14k.validation.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/fr14k/validation")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/fr14k/validation scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.fr14k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 736616,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 1400
},
"qrels": {
"count": 81255,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1400,
"1": 79855
}
}
}
},
"scoreddocs": {
"count": 140000
}
}
WikIR for Italian.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.it16k')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 503012,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
}
}
Test set of wikir/it16k. Scoreddocs are the provided BM25 run.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/test queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.it16k.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/it16k
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/test docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.it16k.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 48K | 96.8% |
| 2 | Query is the article title | 1.6K | 3.2% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.it16k.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/test")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/test scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.it16k.test.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 503012,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 1600
},
"qrels": {
"count": 49338,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1600,
"1": 47738
}
}
}
},
"scoreddocs": {
"count": 160000
}
}
Training set of wikir/it16k. Scoreddocs are the provided BM25 run.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/training queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.it16k.training.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/it16k
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/training docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.it16k.training')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 369K | 96.5% |
| 2 | Query is the article title | 13K | 3.5% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/training qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.it16k.training.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/training")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/training scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.it16k.training.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 503012,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 13418
},
"qrels": {
"count": 381920,
"fields": {
"relevance": {
"counts_by_value": {
"2": 13418,
"1": 368502
}
}
}
},
"scoreddocs": {
"count": 1341800
}
}
Validation set of wikir/it16k. Scoreddocs are the provided BM25 run.
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/validation queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.wikir.it16k.validation.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from wikir/it16k
Language: it
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/validation docs
[doc_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.wikir.it16k.validation')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Otherwise | 0 | 0.0% |
| 1 | There is a link to the article with the query as its title in the first sentence | 43K | 96.4% |
| 2 | Query is the article title | 1.6K | 3.6% |
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/validation qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.wikir.it16k.validation.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Examples:
import ir_datasets
dataset = ir_datasets.load("wikir/it16k/validation")
for scoreddoc in dataset.scoreddocs_iter():
scoreddoc # namedtuple<query_id, doc_id, score>
You can find more details about the Python API here.
ir_datasets export wikir/it16k/validation scoreddocs --format tsv
[query_id] [doc_id] [score]
...
You can find more details about the CLI here.
No example available for PyTerrier
import datamaestro # Supposes experimaestro-ir be installed
run = datamaestro.prepare_dataset('irds.wikir.it16k.validation.scoreddocs') # AdhocRun
# A run is a generic object, and is specialized into final classes
# e.g. TrecAdhocRun
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocRun
Bibtex:
@inproceedings{Frej2020Wikir, title={WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={LREC}, year={2020} } @inproceedings{Frej2020MlWikir, title={MLWIKIR: A Python Toolkit for Building Large-scale Wikipedia-based Information Retrieval Datasets in Chinese, English, French, Italian, Japanese, Spanish and More}, author={Jibril Frej and Didier Schwab and Jean-Pierre Chevallet}, booktitle={CIRCLE}, year={2020} }{
"docs": {
"count": 503012,
"fields": {
"doc_id": {
"max_len": 6,
"common_prefix": ""
}
}
},
"queries": {
"count": 1600
},
"qrels": {
"count": 45003,
"fields": {
"relevance": {
"counts_by_value": {
"2": 1600,
"1": 43403
}
}
}
},
"scoreddocs": {
"count": 160000
}
}