ir_datasets: CodeSearchNetA benchmark for semantic code search. Uses
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, repo, path, func_name, code, language>
You can find more details about the Python API here.
ir_datasets export codesearchnet docs
[doc_id] [repo] [path] [func_name] [code] [language]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codesearchnet')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Bibtex:
@article{Husain2019CodeSearchNet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Hamel Husain and Ho-Hsiang Wu and Tiferet Gazit and Miltiadis Allamanis and Marc Brockschmidt}, journal={ArXiv}, year={2019} }{
"docs": {
"count": 2070536,
"fields": {
"doc_id": {
"max_len": 339,
"common_prefix": "https://github.com/"
}
}
}
}
Official challenge set, with keyword queries and deep relevance assessments.
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/challenge")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export codesearchnet/challenge queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.codesearchnet.challenge.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from codesearchnet
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/challenge")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, repo, path, func_name, code, language>
You can find more details about the Python API here.
ir_datasets export codesearchnet/challenge docs
[doc_id] [repo] [path] [func_name] [code] [language]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codesearchnet.challenge')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 0 | Irrelevant | 1.3K | 32.8% |
| 1 | Weak Match | 982 | 24.5% |
| 2 | String Match | 863 | 21.5% |
| 3 | Exact Match | 847 | 21.1% |
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/challenge")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, note>
You can find more details about the Python API here.
ir_datasets export codesearchnet/challenge qrels --format tsv
[query_id] [doc_id] [relevance] [note]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.codesearchnet.challenge.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Husain2019CodeSearchNet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Hamel Husain and Ho-Hsiang Wu and Tiferet Gazit and Miltiadis Allamanis and Marc Brockschmidt}, journal={ArXiv}, year={2019} }{
"docs": {
"count": 2070536,
"fields": {
"doc_id": {
"max_len": 339,
"common_prefix": "https://github.com/"
}
}
},
"queries": {
"count": 99
},
"qrels": {
"count": 4006,
"fields": {
"relevance": {
"counts_by_value": {
"0": 1314,
"1": 982,
"2": 863,
"3": 847
}
}
}
}
}
Official test set, using queries inferred from docstrings.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/test")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export codesearchnet/test queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.codesearchnet.test.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from codesearchnet
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/test")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, repo, path, func_name, code, language>
You can find more details about the Python API here.
ir_datasets export codesearchnet/test docs
[doc_id] [repo] [path] [func_name] [code] [language]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codesearchnet.test')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Matches docstring | 101K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/test")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export codesearchnet/test qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.codesearchnet.test.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Husain2019CodeSearchNet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Hamel Husain and Ho-Hsiang Wu and Tiferet Gazit and Miltiadis Allamanis and Marc Brockschmidt}, journal={ArXiv}, year={2019} }{
"docs": {
"count": 2070536,
"fields": {
"doc_id": {
"max_len": 339,
"common_prefix": "https://github.com/"
}
}
},
"queries": {
"count": 100529
},
"qrels": {
"count": 100529,
"fields": {
"relevance": {
"counts_by_value": {
"1": 100529
}
}
}
}
}
Official train set, using queries inferred from docstrings.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/train")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export codesearchnet/train queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.codesearchnet.train.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from codesearchnet
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/train")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, repo, path, func_name, code, language>
You can find more details about the Python API here.
ir_datasets export codesearchnet/train docs
[doc_id] [repo] [path] [func_name] [code] [language]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codesearchnet.train')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Matches docstring | 1.9M | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/train")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export codesearchnet/train qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.codesearchnet.train.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Husain2019CodeSearchNet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Hamel Husain and Ho-Hsiang Wu and Tiferet Gazit and Miltiadis Allamanis and Marc Brockschmidt}, journal={ArXiv}, year={2019} }{
"docs": {
"count": 2070536,
"fields": {
"doc_id": {
"max_len": 339,
"common_prefix": "https://github.com/"
}
}
},
"queries": {
"count": 1880853
},
"qrels": {
"count": 1880853,
"fields": {
"relevance": {
"counts_by_value": {
"1": 1880853
}
}
}
}
}
Official validation set, using queries inferred from docstrings.
Language: en
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/valid")
for query in dataset.queries_iter():
query # namedtuple<query_id, text>
You can find more details about the Python API here.
ir_datasets export codesearchnet/valid queries
[query_id] [text]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.codesearchnet.valid.queries') # AdhocTopics
for topic in topics.iter():
print(topic) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.
Inherits docs from codesearchnet
Language: multiple/other/unknown
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/valid")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, repo, path, func_name, code, language>
You can find more details about the Python API here.
ir_datasets export codesearchnet/valid docs
[doc_id] [repo] [path] [func_name] [code] [language]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.codesearchnet.valid')
for doc in dataset.iter_documents():
print(doc) # an AdhocDocumentStore
break
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore
Relevance levels
| Rel. | Definition | Count | % |
|---|---|---|---|
| 1 | Matches docstring | 89K | 100.0% |
Examples:
import ir_datasets
dataset = ir_datasets.load("codesearchnet/valid")
for qrel in dataset.qrels_iter():
qrel # namedtuple<query_id, doc_id, relevance, iteration>
You can find more details about the Python API here.
ir_datasets export codesearchnet/valid qrels --format tsv
[query_id] [doc_id] [relevance] [iteration]
...
You can find more details about the CLI here.
No example available for PyTerrier
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.codesearchnet.valid.qrels') # AdhocAssessments
for topic_qrels in qrels.iter():
print(topic_qrels) # An AdhocTopic
This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.
Bibtex:
@article{Husain2019CodeSearchNet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Hamel Husain and Ho-Hsiang Wu and Tiferet Gazit and Miltiadis Allamanis and Marc Brockschmidt}, journal={ArXiv}, year={2019} }{
"docs": {
"count": 2070536,
"fields": {
"doc_id": {
"max_len": 339,
"common_prefix": "https://github.com/"
}
}
},
"queries": {
"count": 89154
},
"qrels": {
"count": 89154,
"fields": {
"relevance": {
"counts_by_value": {
"1": 89154
}
}
}
}
}