← home
Github: datasets/neuclir.py

ir_datasets: NeuCLIR Corpus

Index
  1. neuclir
  2. neuclir/1
  3. neuclir/1/fa
  4. neuclir/1/fa/hc4-filtered
  5. neuclir/1/fa/trec-2022
  6. neuclir/1/fa/trec-2023
  7. neuclir/1/multi
  8. neuclir/1/multi/trec-2023
  9. neuclir/1/ru
  10. neuclir/1/ru/hc4-filtered
  11. neuclir/1/ru/trec-2022
  12. neuclir/1/ru/trec-2023
  13. neuclir/1/zh
  14. neuclir/1/zh/hc4-filtered
  15. neuclir/1/zh/trec-2022
  16. neuclir/1/zh/trec-2023

"neuclir"

This is the dataset created for TREC 2022 NeuCLIR Track. Topics will be developed and released by June 2022 by NIST. Relevance judgements will be available after the evaluation (around November).

The collection designed to be similar to [HC4] and a large portion of documents from HC4 are ported to this collection. Users can conduct experiemnts on this collection with queries and qrels in HC4 for development.

  • Documents: Web pages from Common Crawl in Chinese, Persian, and Russian.
  • Queries: (To be released) English TREC-style title/description queries. Narrative field contains an example passage for each relevance level. Human and machine translation of the titles and descriptions in the target language (i.e., document language) are provided in the query object.
  • Qrels: (To be released) Documents are judged in three levels of relevance. Please refer to the dataset paper for the full definition of the levels.
  • See also: hc4
  • NeuCLIR Track Website
  • Collection Repository

"neuclir/1"

Version 1 of the NeuCLIR corpus.


"neuclir/1/fa"

The Persian collection contains English queries (to be released) and Persian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Persian is available.

docs
2.2M docs

Language: fa

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.fa')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Metadata

"neuclir/1/fa/hc4-filtered"

Subset of the Persian collection that intersect with HC4. The 60 queries are the hc4/fa/dev and hc4/fa/test sets combined.

queries
60 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/hc4-filtered queries
[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.fa.hc4-filtered.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
392K docs

Language: fa

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/hc4-filtered")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/hc4-filtered docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.fa.hc4-filtered')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
3.1K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.2.6K82.8%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.261 8.5%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.269 8.7%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/hc4-filtered")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/hc4-filtered qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.fa.hc4-filtered.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }
Metadata

"neuclir/1/fa/trec-2022"

Topics and assessments for the TREC NeuCLIR 2022 (Persian language CLIR).

queries
46 queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2022")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/trec-2022 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.fa.trec-2022.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
2.2M docs

Inherits docs from neuclir/1/fa

Language: fa

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2022")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/trec-2022 docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.fa.trec-2022')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
34K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.33K95.7%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.602 1.8%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.870 2.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2022")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/trec-2022 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.fa.trec-2022.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Metadata

"neuclir/1/fa/trec-2023"

Topics and assessments for the TREC NeuCLIR 2023 (Persian language CLIR).

queries
76 queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2023")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/trec-2023 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.fa.trec-2023.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
2.2M docs

Inherits docs from neuclir/1/fa

Language: fa

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/trec-2023 docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.fa.trec-2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
27K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.22K81.1%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.2.5K9.3%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.479 1.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2023")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/fa/trec-2023 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.fa.trec-2023.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Metadata

"neuclir/1/multi"

A combined corpus of NeuCLIR v1 including all Persian, Russian, and Chinese documents.

docs
10M docs

Language: multiple/other/unknown

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/multi")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/multi docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.multi')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Metadata

"neuclir/1/multi/trec-2023"

Topics and assessments for the TREC NeuCLIR 2023 multi-language retrieval task.

queries
76 queries

Language: multiple/other/unknown

Query type:
ExctractedCCMultiMtQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str
  5. fa_mt_title: str
  6. fa_mt_description: str
  7. fa_mt_narrative: str
  8. ru_mt_title: str
  9. ru_mt_description: str
  10. ru_mt_narrative: str
  11. zh_mt_title: str
  12. zh_mt_description: str
  13. zh_mt_narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/multi/trec-2023")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative, fa_mt_title, fa_mt_description, fa_mt_narrative, ru_mt_title, ru_mt_description, ru_mt_narrative, zh_mt_title, zh_mt_description, zh_mt_narrative>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/multi/trec-2023 queries
[query_id]    [title]    [description]    [narrative]    [fa_mt_title]    [fa_mt_description]    [fa_mt_narrative]    [ru_mt_title]    [ru_mt_description]    [ru_mt_narrative]    [zh_mt_title]    [zh_mt_description]    [zh_mt_narrative]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.multi.trec-2023.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
10M docs

Inherits docs from neuclir/1/multi

Language: multiple/other/unknown

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/multi/trec-2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/multi/trec-2023 docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.multi.trec-2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
80K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.66K82.7%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.6.0K7.6%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.635 0.8%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/multi/trec-2023")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/multi/trec-2023 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.multi.trec-2023.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Metadata

"neuclir/1/ru"

The Russian collection contains English queries (to be released) and Russian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

docs
4.6M docs

Language: ru

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.ru')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Metadata

"neuclir/1/ru/hc4-filtered"

Subset of the Russian collection that intersect with HC4. The 54 queries are the hc4/ru/dev and hc4/ru/test sets combined.

queries
54 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/hc4-filtered queries
[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.ru.hc4-filtered.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
965K docs

Language: ru

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/hc4-filtered")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/hc4-filtered docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.ru.hc4-filtered')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
3.2K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.2.5K76.8%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.478 14.8%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.274 8.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/hc4-filtered")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/hc4-filtered qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.ru.hc4-filtered.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }
Metadata

"neuclir/1/ru/trec-2022"

Topics and assessments for the TREC NeuCLIR 2022 (Russian language CLIR).

queries
45 queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2022")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/trec-2022 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.ru.trec-2022.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
4.6M docs

Inherits docs from neuclir/1/ru

Language: ru

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2022")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/trec-2022 docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.ru.trec-2022')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
33K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.31K94.3%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.1.1K3.3%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.810 2.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2022")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/trec-2022 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.ru.trec-2022.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Metadata

"neuclir/1/ru/trec-2023"

Topics and assessments for the TREC NeuCLIR 2023 (Russian language CLIR).

queries
76 queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2023")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/trec-2023 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.ru.trec-2023.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
4.6M docs

Inherits docs from neuclir/1/ru

Language: ru

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/trec-2023 docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.ru.trec-2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
26K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.21K81.6%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.1.4K5.5%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.117 0.5%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2023")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/ru/trec-2023 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.ru.trec-2023.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Metadata

"neuclir/1/zh"

The Chinese collection contains English queries (to be released) and Chinese documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

docs
3.2M docs

Language: zh

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.zh')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

Metadata

"neuclir/1/zh/hc4-filtered"

Subset of the Chinse collection that intersect with HC4. The 60 queries are the hc4/zh/dev and hc4/zh/test sets combined.

queries
60 queries

Language: multiple/other/unknown

Query type:
ExctractedCCQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. ht_title: str
  5. ht_description: str
  6. mt_title: str
  7. mt_description: str
  8. narrative_by_relevance: Dict[str,str]
  9. report: str
  10. report_url: str
  11. report_date: str
  12. translation_lang: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/hc4-filtered queries
[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.zh.hc4-filtered.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
520K docs

Language: zh

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/hc4-filtered")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/hc4-filtered docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.zh.hc4-filtered')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
3.2K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.2.7K82.4%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.222 6.9%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.344 10.7%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/hc4-filtered")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/hc4-filtered qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.zh.hc4-filtered.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }
Metadata

"neuclir/1/zh/trec-2022"

Topics and assessments for the TREC NeuCLIR 2022 (Chinese language CLIR).

queries
49 queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2022")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/trec-2022 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.zh.trec-2022.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
3.2M docs

Inherits docs from neuclir/1/zh

Language: zh

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2022")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/trec-2022 docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.zh.trec-2022')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
37K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.34K94.2%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.1.4K3.9%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.720 2.0%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2022")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/trec-2022 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.zh.trec-2022.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Metadata

"neuclir/1/zh/trec-2023"

Topics and assessments for the TREC NeuCLIR 2023 (Chinese language CLIR).

queries
76 queries

Language: multiple/other/unknown

Query type:
GenericQuery: (namedtuple)
  1. query_id: str
  2. text: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2023")
for query in dataset.queries_iter():
    query # namedtuple<query_id, text>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/trec-2023 queries
[query_id]    [text]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.zh.trec-2023.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs
3.2M docs

Inherits docs from neuclir/1/zh

Language: zh

Document type:
ExctractedCCDoc: (namedtuple)
  1. doc_id: str
  2. title: str
  3. text: str
  4. url: str
  5. time: str
  6. cc_file: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/trec-2023 docs
[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.zh.trec-2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels
28K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.24K85.2%
1Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.2.1K7.7%
3Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.39 0.1%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2023")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export neuclir/1/zh/trec-2023 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier

No example available for PyTerrier

XPM-IR
from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.zh.trec-2023.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Metadata