ir_datasets : NeuCLIR Corpus

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.fa')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

{
  "docs": {
    "count": 2232016,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  }
}

`"neuclir/1/fa/hc4-filtered"`

Subset of the Persian collection that intersect with HC4. The 60 queries are the hc4/fa/dev and hc4/fa/test sets combined.

60 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/hc4-filtered queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.fa.hc4-filtered.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

392K docs

Language: fa

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/hc4-filtered")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/hc4-filtered docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.fa.hc4-filtered')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

3.1K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`2.6K`	82.8%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`261`	8.5%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`269`	8.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/hc4-filtered")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/hc4-filtered qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.fa.hc4-filtered.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

@article{Lawrie2022HC4, author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang}, title = {HC4: A New Suite of Test Collections for Ad Hoc CLIR}, booktitle = {{Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}}, year = {2022}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, url = {https://arxiv.org/abs/2201.09992} }

{
  "docs": {
    "count": 391703,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 60
  },
  "qrels": {
    "count": 3087,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 2557,
          "3": 269,
          "1": 261
        }
      }
    }
  }
}

`"neuclir/1/fa/trec-2022"`

Topics and assessments for the TREC NeuCLIR 2022 (Persian language CLIR).

46 queries

Language: multiple/other/unknown

Query type:

ExctractedCCNoReportQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str
ht_title: str
ht_description: str
ht_narrative: str
mt_title: str
mt_description: str
mt_narrative: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2022")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative, ht_title, ht_description, ht_narrative, mt_title, mt_description, mt_narrative, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/trec-2022 queries



[query_id]    [title]    [description]    [narrative]    [ht_title]    [ht_description]    [ht_narrative]    [mt_title]    [mt_description]    [mt_narrative]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.fa.trec-2022.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

2.2M docs

Inherits docs from neuclir/1/fa

Language: fa

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2022")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/trec-2022 docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.fa.trec-2022')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

34K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`33K`	95.7%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`602`	1.8%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`870`	2.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2022")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/trec-2022 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.fa.trec-2022.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

{
  "docs": {
    "count": 2232016,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 46
  },
  "qrels": {
    "count": 34174,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "3": 870,
          "0": 32702,
          "1": 602
        }
      }
    }
  }
}

`"neuclir/1/fa/trec-2023"`

Topics and assessments for the TREC NeuCLIR 2023 (Persian language CLIR).

76 queries

Language: multiple/other/unknown

Query type:

ExctractedCCNoReportNoHtNarQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
mt_narrative: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2023")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative, ht_title, ht_description, mt_title, mt_description, mt_narrative, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/trec-2023 queries



[query_id]    [title]    [description]    [narrative]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [mt_narrative]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.fa.trec-2023.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

2.2M docs

Inherits docs from neuclir/1/fa

Language: fa

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/trec-2023 docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.fa.trec-2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

27K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`22K`	81.1%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`2.5K`	9.3%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`479`	1.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/fa/trec-2023")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/fa/trec-2023 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.fa.trec-2023.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

{
  "docs": {
    "count": 2232016,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 76
  },
  "qrels": {
    "count": 26662,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 2491,
          "2": 2068,
          "0": 21624,
          "3": 479
        }
      }
    }
  }
}

`"neuclir/1/multi"`

A combined corpus of NeuCLIR v1 including all Persian, Russian, and Chinese documents.

docs

10M docs

Language: multiple/other/unknown

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/multi")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/multi docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.multi')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

{
  "docs": {
    "count": 10038768,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  }
}

`"neuclir/1/multi/trec-2023"`

Topics and assessments for the TREC NeuCLIR 2023 multi-language retrieval task.

76 queries

Language: multiple/other/unknown

Query type:

ExctractedCCMultiMtQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str
fa_mt_title: str
fa_mt_description: str
fa_mt_narrative: str
ru_mt_title: str
ru_mt_description: str
ru_mt_narrative: str
zh_mt_title: str
zh_mt_description: str
zh_mt_narrative: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/multi/trec-2023")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative, fa_mt_title, fa_mt_description, fa_mt_narrative, ru_mt_title, ru_mt_description, ru_mt_narrative, zh_mt_title, zh_mt_description, zh_mt_narrative>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/multi/trec-2023 queries



[query_id]    [title]    [description]    [narrative]    [fa_mt_title]    [fa_mt_description]    [fa_mt_narrative]    [ru_mt_title]    [ru_mt_description]    [ru_mt_narrative]    [zh_mt_title]    [zh_mt_description]    [zh_mt_narrative]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.multi.trec-2023.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

10M docs

Inherits docs from neuclir/1/multi

Language: multiple/other/unknown

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/multi/trec-2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/multi/trec-2023 docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.multi.trec-2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

80K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`66K`	82.7%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`6.0K`	7.6%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`635`	0.8%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/multi/trec-2023")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/multi/trec-2023 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.multi.trec-2023.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

{
  "docs": {
    "count": 10038768,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 76
  },
  "qrels": {
    "count": 79934,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "1": 6041,
          "2": 7171,
          "0": 66087,
          "3": 635
        }
      }
    }
  }
}

`"neuclir/1/ru"`

The Russian collection contains English queries (to be released) and Russian documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Russian is available.

docs

4.6M docs

Language: ru

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.ru')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

{
  "docs": {
    "count": 4627543,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  }
}

`"neuclir/1/ru/hc4-filtered"`

Subset of the Russian collection that intersect with HC4. The 54 queries are the hc4/ru/dev and hc4/ru/test sets combined.

54 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/hc4-filtered queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.ru.hc4-filtered.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

965K docs

Language: ru

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/hc4-filtered")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/hc4-filtered docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.ru.hc4-filtered')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

3.2K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`2.5K`	76.8%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`478`	14.8%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`274`	8.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/hc4-filtered")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/hc4-filtered qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.ru.hc4-filtered.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

{
  "docs": {
    "count": 964719,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 54
  },
  "qrels": {
    "count": 3235,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 2483,
          "1": 478,
          "3": 274
        }
      }
    }
  }
}

`"neuclir/1/ru/trec-2022"`

Topics and assessments for the TREC NeuCLIR 2022 (Russian language CLIR).

45 queries

Language: multiple/other/unknown

Query type:

ExctractedCCNoReportQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str
ht_title: str
ht_description: str
ht_narrative: str
mt_title: str
mt_description: str
mt_narrative: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2022")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative, ht_title, ht_description, ht_narrative, mt_title, mt_description, mt_narrative, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/trec-2022 queries



[query_id]    [title]    [description]    [narrative]    [ht_title]    [ht_description]    [ht_narrative]    [mt_title]    [mt_description]    [mt_narrative]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.ru.trec-2022.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

4.6M docs

Inherits docs from neuclir/1/ru

Language: ru

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2022")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/trec-2022 docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.ru.trec-2022')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

33K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`31K`	94.3%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`1.1K`	3.3%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`810`	2.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2022")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/trec-2022 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.ru.trec-2022.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

{
  "docs": {
    "count": 4627543,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 45
  },
  "qrels": {
    "count": 33006,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "3": 810,
          "0": 31117,
          "1": 1079
        }
      }
    }
  }
}

`"neuclir/1/ru/trec-2023"`

Topics and assessments for the TREC NeuCLIR 2023 (Russian language CLIR).

76 queries

Language: multiple/other/unknown

Query type:

ExctractedCCNoReportNoHtNarQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
mt_narrative: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2023")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative, ht_title, ht_description, mt_title, mt_description, mt_narrative, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/trec-2023 queries



[query_id]    [title]    [description]    [narrative]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [mt_narrative]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.ru.trec-2023.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

4.6M docs

Inherits docs from neuclir/1/ru

Language: ru

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/trec-2023 docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.ru.trec-2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

26K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`21K`	81.6%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`1.4K`	5.5%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`117`	0.5%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/ru/trec-2023")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/ru/trec-2023 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.ru.trec-2023.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

{
  "docs": {
    "count": 4627543,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 76
  },
  "qrels": {
    "count": 25634,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 20905,
          "2": 3190,
          "3": 117,
          "1": 1422
        }
      }
    }
  }
}

`"neuclir/1/zh"`

The Chinese collection contains English queries (to be released) and Chinese documents for retrieval. Human and machine translated queries will be provided in the query object for running monolingual retrieval or cross-language retrival assuming the machine query tranlstion into Chinese is available.

docs

3.2M docs

Language: zh

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.zh')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

{
  "docs": {
    "count": 3179209,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  }
}

`"neuclir/1/zh/hc4-filtered"`

Subset of the Chinse collection that intersect with HC4. The 60 queries are the hc4/zh/dev and hc4/zh/test sets combined.

60 queries

Language: multiple/other/unknown

Query type:

ExctractedCCQuery: (namedtuple)

query_id: str
title: str
description: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
narrative_by_relevance: Dict[str,str]
report: str
report_url: str
report_date: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/hc4-filtered")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, ht_title, ht_description, mt_title, mt_description, narrative_by_relevance, report, report_url, report_date, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/hc4-filtered queries



[query_id]    [title]    [description]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [narrative_by_relevance]    [report]    [report_url]    [report_date]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.zh.hc4-filtered.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

520K docs

Language: zh

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/hc4-filtered")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/hc4-filtered docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.zh.hc4-filtered')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

3.2K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`2.7K`	82.4%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`222`	6.9%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`344`	10.7%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/hc4-filtered")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/hc4-filtered qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.zh.hc4-filtered.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

Citation

ir_datasets.bib:

\cite{Lawrie2022HC4}

Bibtex:

{
  "docs": {
    "count": 519945,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 60
  },
  "qrels": {
    "count": 3217,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 2651,
          "3": 344,
          "1": 222
        }
      }
    }
  }
}

`"neuclir/1/zh/trec-2022"`

Topics and assessments for the TREC NeuCLIR 2022 (Chinese language CLIR).

49 queries

Language: multiple/other/unknown

Query type:

ExctractedCCNoReportQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str
ht_title: str
ht_description: str
ht_narrative: str
mt_title: str
mt_description: str
mt_narrative: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2022")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative, ht_title, ht_description, ht_narrative, mt_title, mt_description, mt_narrative, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/trec-2022 queries



[query_id]    [title]    [description]    [narrative]    [ht_title]    [ht_description]    [ht_narrative]    [mt_title]    [mt_description]    [mt_narrative]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.zh.trec-2022.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

3.2M docs

Inherits docs from neuclir/1/zh

Language: zh

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2022")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/trec-2022 docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.zh.trec-2022')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

37K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`34K`	94.2%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`1.4K`	3.9%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`720`	2.0%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2022")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/trec-2022 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.zh.trec-2022.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.

{
  "docs": {
    "count": 3179209,
    "fields": {
      "doc_id": {
        "max_len": 36,
        "common_prefix": ""
      }
    }
  },
  "queries": {
    "count": 49
  },
  "qrels": {
    "count": 36575,
    "fields": {
      "relevance": {
        "counts_by_value": {
          "0": 34442,
          "3": 720,
          "1": 1413
        }
      }
    }
  }
}

`"neuclir/1/zh/trec-2023"`

Topics and assessments for the TREC NeuCLIR 2023 (Chinese language CLIR).

76 queries

Language: multiple/other/unknown

Query type:

ExctractedCCNoReportNoHtNarQuery: (namedtuple)

query_id: str
title: str
description: str
narrative: str
ht_title: str
ht_description: str
mt_title: str
mt_description: str
mt_narrative: str
translation_lang: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2023")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative, ht_title, ht_description, mt_title, mt_description, mt_narrative, translation_lang>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/trec-2023 queries



[query_id]    [title]    [description]    [narrative]    [ht_title]    [ht_description]    [mt_title]    [mt_description]    [mt_narrative]    [translation_lang]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
topics = prepare_dataset('irds.neuclir.1.zh.trec-2023.queries')  # AdhocTopics
for topic in topics.iter():
    print(topic)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocTopics.

docs

3.2M docs

Inherits docs from neuclir/1/zh

Language: zh

Document type:

ExctractedCCDoc: (namedtuple)

doc_id: str
title: str
text: str
url: str
time: str
cc_file: str

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2023")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, title, text, url, time, cc_file>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/trec-2023 docs



[doc_id]    [title]    [text]    [url]    [time]    [cc_file]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
dataset = prepare_dataset('irds.neuclir.1.zh.trec-2023')
for doc in dataset.iter_documents():
    print(doc)  # an AdhocDocumentStore
    break

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocDocumentStore

qrels

28K qrels

Query relevance judgment type:

TrecQrel: (namedtuple)

query_id: str
doc_id: str
relevance: int
iteration: str

Relevance levels

Rel.	Definition	Count	%
0	Not-valuable. Information in the document might be included in a report footnote, or omitted entirely.	`24K`	85.2%
1	Somewhat-valuable. The most valuable information in the document would be found in the remainder of such a report.	`2.1K`	7.7%
3	Very-valuable. Information in the document would be found in the lead paragraph of a report that is later written on the topic.	`39`	0.1%

Examples:

import ir_datasets
dataset = ir_datasets.load("neuclir/1/zh/trec-2023")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI

ir_datasets export neuclir/1/zh/trec-2023 qrels --format tsv



[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

No example available for PyTerrier

from datamaestro import prepare_dataset
qrels = prepare_dataset('irds.neuclir.1.zh.trec-2023.qrels')  # AdhocAssessments
for topic_qrels in qrels.iter():
    print(topic_qrels)  # An AdhocTopic

This examples requires that experimaestro-ir be installed. For more information about the returned object, see the documentation about AdhocAssessments.