← home
Github: datasets/touche_image.py

ir_datasets: Touché Image Search

Index
  1. touche-image
  2. touche-image/2022-06-13
  3. touche-image/2022-06-13/touche-2022-task-3

"touche-image"

Focused crawl of about 23 841 images (and associated web pages) as document collection.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

Citation

ir_datasets.bib:

\cite{Bondarenko2022Touche,Kiesel2021Image,Dimitrov2021SemEval,Yanai2007Image}

Bibtex:

@inproceedings{Bondarenko2022Touche, address = {Berlin Heidelberg New York}, author = {Alexander Bondarenko and Maik Fr{\"o}be and Johannes Kiesel and Shahbaz Syed and Timon Gurcke and Meriem Beloucif and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022)}, editor = {Alberto Barr{\'o}n-Cede{\~n}o and Giovanni Da San Martino and Mirko Degli Esposti and Fabrizio Sebastiani and Craig Macdonald and Gabriella Pasi and Allan Hanbury and Martin Potthast and Guglielmo Faggioli and Nicola Ferro}, month = sep, numpages = 29, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Bologna, Italy}, title = {{Overview of Touch{\'e} 2022: Argument Retrieval}}, year = 2022 } @inproceedings{Kiesel2021Image, author = {Johannes Kiesel and Nico Reichenbach and Benno Stein and Martin Potthast}, booktitle = {8th Workshop on Argument Mining (ArgMining 2021) at EMNLP}, doi = {10.18653/v1/2021.argmining-1.4}, editor = {Khalid Al-Khatib and Yufang Hou and Manfred Stede}, month = nov, pages = {36-45}, publisher = {Association for Computational Linguistics}, site = {Punta Cana, Dominican Republic}, title = {{Image Retrieval for Arguments Using Stance-Aware Query Expansion}}, url = {https://aclanthology.org/2021.argmining-1.4/}, year = 2021 } @inproceedings{Dimitrov2021SemEval, author = {Dimitar Dimitrov and Bishr Bin Ali and Shaden Shaar and Firoj Alam and Fabrizio Silvestri and Hamed Firooz and Preslav Nakov and Giovanni Da San Martino}, editor = {Alexis Palmer and Nathan Schneider and Natalie Schluter and Guy Emerson and Aur{\'{e}}lie Herbelot and Xiaodan Zhu}, title = {SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images}, booktitle = {Proceedings of the 15th International Workshop on Semantic Evaluation, SemEval@ACL/IJCNLP 2021, Virtual Event / Bangkok, Thailand, August 5-6, 2021}, pages = {70--98}, publisher = {Association for Computational Linguistics}, year = {2021}, doi = {10.18653/v1/2021.semeval-1.7}, } @inproceedings{Yanai2007Image, author = {Keiji Yanai}, editor = {Carey L. Williamson and Mary Ellen Zurko and Peter F. Patel{-}Schneider and Prashant J. Shenoy}, title = {Image collector {III:} a web image-gathering system with bag-of-keypoints}, booktitle = {Proceedings of the 16th International Conference on World Wide Web, {WWW} 2007, Banff, Alberta, Canada, May 8-12, 2007}, pages = {1295--1296}, publisher = {{ACM}}, year = {2007}, doi = {10.1145/1242572.1242816}, }

"touche-image/2022-06-13"

Corpus version 2022-06-13 with 23 841 images. It was released on June 13, 2022 on Zenodo.

This collection is licensed with the Creative Commons Attribution 4.0 International. Individual rights to the content still apply.

docs
24K docs

Language: en

Document type:
ToucheImageDoc: (namedtuple)
  1. doc_id: str
  2. png: bytes
  3. webp: bytes
  4. url: str
  5. phash: str
  6. pages: List[
    ToucheImagePage: (namedtuple)
    1. page_id: str
    2. url: str
    3. rankings: List[
      ToucheImageRanking: (namedtuple)
      1. query_id: str
      2. query: str
      3. rank: int
      ]
    4. dom_html: bytes
    5. xpaths: List[str]
    6. nodes: List[
      ToucheImageNode: (namedtuple)
      1. xpath: str
      2. visible: bool
      3. id: Optional[str]
      4. classes: List[str]
      5. position: Tuple[float,float,float,float]
      6. text: Optional[str]
      7. css: Dict[str,str]
      ]
    7. text: str
    ]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("touche-image/2022-06-13")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, png, webp, url, phash, pages>

You can find more details about the Python API here.

CLI
ir_datasets export touche-image/2022-06-13 docs
[doc_id]    [png]    [webp]    [url]    [phash]    [pages]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:touche-image/2022-06-13')
# Index touche-image/2022-06-13
indexer = pt.IterDictIndexer('./indices/touche-image_2022-06-13')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'phash'])

You can find more details about PyTerrier indexing here.

Citation

ir_datasets.bib:

\cite{Bondarenko2022Touche,Kiesel2021Image,Dimitrov2021SemEval,Yanai2007Image}

Bibtex:

@inproceedings{Bondarenko2022Touche, address = {Berlin Heidelberg New York}, author = {Alexander Bondarenko and Maik Fr{\"o}be and Johannes Kiesel and Shahbaz Syed and Timon Gurcke and Meriem Beloucif and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022)}, editor = {Alberto Barr{\'o}n-Cede{\~n}o and Giovanni Da San Martino and Mirko Degli Esposti and Fabrizio Sebastiani and Craig Macdonald and Gabriella Pasi and Allan Hanbury and Martin Potthast and Guglielmo Faggioli and Nicola Ferro}, month = sep, numpages = 29, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Bologna, Italy}, title = {{Overview of Touch{\'e} 2022: Argument Retrieval}}, year = 2022 } @inproceedings{Kiesel2021Image, author = {Johannes Kiesel and Nico Reichenbach and Benno Stein and Martin Potthast}, booktitle = {8th Workshop on Argument Mining (ArgMining 2021) at EMNLP}, doi = {10.18653/v1/2021.argmining-1.4}, editor = {Khalid Al-Khatib and Yufang Hou and Manfred Stede}, month = nov, pages = {36-45}, publisher = {Association for Computational Linguistics}, site = {Punta Cana, Dominican Republic}, title = {{Image Retrieval for Arguments Using Stance-Aware Query Expansion}}, url = {https://aclanthology.org/2021.argmining-1.4/}, year = 2021 } @inproceedings{Dimitrov2021SemEval, author = {Dimitar Dimitrov and Bishr Bin Ali and Shaden Shaar and Firoj Alam and Fabrizio Silvestri and Hamed Firooz and Preslav Nakov and Giovanni Da San Martino}, editor = {Alexis Palmer and Nathan Schneider and Natalie Schluter and Guy Emerson and Aur{\'{e}}lie Herbelot and Xiaodan Zhu}, title = {SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images}, booktitle = {Proceedings of the 15th International Workshop on Semantic Evaluation, SemEval@ACL/IJCNLP 2021, Virtual Event / Bangkok, Thailand, August 5-6, 2021}, pages = {70--98}, publisher = {Association for Computational Linguistics}, year = {2021}, doi = {10.18653/v1/2021.semeval-1.7}, } @inproceedings{Yanai2007Image, author = {Keiji Yanai}, editor = {Carey L. Williamson and Mary Ellen Zurko and Peter F. Patel{-}Schneider and Prashant J. Shenoy}, title = {Image collector {III:} a web image-gathering system with bag-of-keypoints}, booktitle = {Proceedings of the 16th International Conference on World Wide Web, {WWW} 2007, Banff, Alberta, Canada, May 8-12, 2007}, pages = {1295--1296}, publisher = {{ACM}}, year = {2007}, doi = {10.1145/1242572.1242816}, }
Metadata

"touche-image/2022-06-13/touche-2022-task-3"

Decision making processes, be it at the societal or at the personal level, often come to a point where one side challenges the other with a why-question, which is a prompt to justify some stance based on arguments. Since technologies for argument mining are maturing at a rapid pace, also ad-hoc argument retrieval becomes a feasible task in reach. Touché 2022 is the third lab on argument retrieval at CLEF 2022 featuring three tasks.

Given a controversial topic, the task is to retrieve images (from touche-image/2022-06-13) for each stance (pro/con) that show support for that stance.

Systems are evaluated on Touché topics 1-50 by the ratio of images among the 20 retrieved images for each topic (10 images for each stance) that are all three: relevant to the topic, argumentative, and have the associated stance.

queries
50 queries

Language: en

Query type:
ToucheQuery: (namedtuple)
  1. query_id: str
  2. title: str
  3. description: str
  4. narrative: str

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("touche-image/2022-06-13/touche-2022-task-3")
for query in dataset.queries_iter():
    query # namedtuple<query_id, title, description, narrative>

You can find more details about the Python API here.

CLI
ir_datasets export touche-image/2022-06-13/touche-2022-task-3 queries
[query_id]    [title]    [description]    [narrative]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:touche-image/2022-06-13/touche-2022-task-3')
index_ref = pt.IndexRef.of('./indices/touche-image_2022-06-13') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pipeline(dataset.get_topics('title'))

You can find more details about PyTerrier retrieval here.

docs
24K docs

Inherits docs from touche-image/2022-06-13

Language: en

Document type:
ToucheImageDoc: (namedtuple)
  1. doc_id: str
  2. png: bytes
  3. webp: bytes
  4. url: str
  5. phash: str
  6. pages: List[
    ToucheImagePage: (namedtuple)
    1. page_id: str
    2. url: str
    3. rankings: List[
      ToucheImageRanking: (namedtuple)
      1. query_id: str
      2. query: str
      3. rank: int
      ]
    4. dom_html: bytes
    5. xpaths: List[str]
    6. nodes: List[
      ToucheImageNode: (namedtuple)
      1. xpath: str
      2. visible: bool
      3. id: Optional[str]
      4. classes: List[str]
      5. position: Tuple[float,float,float,float]
      6. text: Optional[str]
      7. css: Dict[str,str]
      ]
    7. text: str
    ]

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("touche-image/2022-06-13/touche-2022-task-3")
for doc in dataset.docs_iter():
    doc # namedtuple<doc_id, png, webp, url, phash, pages>

You can find more details about the Python API here.

CLI
ir_datasets export touche-image/2022-06-13/touche-2022-task-3 docs
[doc_id]    [png]    [webp]    [url]    [phash]    [pages]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
pt.init()
dataset = pt.get_dataset('irds:touche-image/2022-06-13/touche-2022-task-3')
# Index touche-image/2022-06-13
indexer = pt.IterDictIndexer('./indices/touche-image_2022-06-13')
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['url', 'phash'])

You can find more details about PyTerrier indexing here.

qrels
20K qrels
Query relevance judgment type:
TrecQrel: (namedtuple)
  1. query_id: str
  2. doc_id: str
  3. relevance: int
  4. iteration: str

Relevance levels

Rel.DefinitionCount%
0not relevant11K55.9%
1relevant8.7K44.1%

Examples:

Python API
import ir_datasets
dataset = ir_datasets.load("touche-image/2022-06-13/touche-2022-task-3")
for qrel in dataset.qrels_iter():
    qrel # namedtuple<query_id, doc_id, relevance, iteration>

You can find more details about the Python API here.

CLI
ir_datasets export touche-image/2022-06-13/touche-2022-task-3 qrels --format tsv
[query_id]    [doc_id]    [relevance]    [iteration]
...

You can find more details about the CLI here.

PyTerrier
import pyterrier as pt
from pyterrier.measures import *
pt.init()
dataset = pt.get_dataset('irds:touche-image/2022-06-13/touche-2022-task-3')
index_ref = pt.IndexRef.of('./indices/touche-image_2022-06-13') # assumes you have already built an index
pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
# (optionally other pipeline components)
pt.Experiment(
    [pipeline],
    dataset.get_topics('title'),
    dataset.get_qrels(),
    [MAP, nDCG@20]
)

You can find more details about PyTerrier experiments here.

Citation

ir_datasets.bib:

\cite{Bondarenko2022Touche,Kiesel2021Image,Dimitrov2021SemEval,Yanai2007Image}

Bibtex:

@inproceedings{Bondarenko2022Touche, address = {Berlin Heidelberg New York}, author = {Alexander Bondarenko and Maik Fr{\"o}be and Johannes Kiesel and Shahbaz Syed and Timon Gurcke and Meriem Beloucif and Alexander Panchenko and Chris Biemann and Benno Stein and Henning Wachsmuth and Martin Potthast and Matthias Hagen}, booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022)}, editor = {Alberto Barr{\'o}n-Cede{\~n}o and Giovanni Da San Martino and Mirko Degli Esposti and Fabrizio Sebastiani and Craig Macdonald and Gabriella Pasi and Allan Hanbury and Martin Potthast and Guglielmo Faggioli and Nicola Ferro}, month = sep, numpages = 29, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Bologna, Italy}, title = {{Overview of Touch{\'e} 2022: Argument Retrieval}}, year = 2022 } @inproceedings{Kiesel2021Image, author = {Johannes Kiesel and Nico Reichenbach and Benno Stein and Martin Potthast}, booktitle = {8th Workshop on Argument Mining (ArgMining 2021) at EMNLP}, doi = {10.18653/v1/2021.argmining-1.4}, editor = {Khalid Al-Khatib and Yufang Hou and Manfred Stede}, month = nov, pages = {36-45}, publisher = {Association for Computational Linguistics}, site = {Punta Cana, Dominican Republic}, title = {{Image Retrieval for Arguments Using Stance-Aware Query Expansion}}, url = {https://aclanthology.org/2021.argmining-1.4/}, year = 2021 } @inproceedings{Dimitrov2021SemEval, author = {Dimitar Dimitrov and Bishr Bin Ali and Shaden Shaar and Firoj Alam and Fabrizio Silvestri and Hamed Firooz and Preslav Nakov and Giovanni Da San Martino}, editor = {Alexis Palmer and Nathan Schneider and Natalie Schluter and Guy Emerson and Aur{\'{e}}lie Herbelot and Xiaodan Zhu}, title = {SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and Images}, booktitle = {Proceedings of the 15th International Workshop on Semantic Evaluation, SemEval@ACL/IJCNLP 2021, Virtual Event / Bangkok, Thailand, August 5-6, 2021}, pages = {70--98}, publisher = {Association for Computational Linguistics}, year = {2021}, doi = {10.18653/v1/2021.semeval-1.7}, } @inproceedings{Yanai2007Image, author = {Keiji Yanai}, editor = {Carey L. Williamson and Mary Ellen Zurko and Peter F. Patel{-}Schneider and Prashant J. Shenoy}, title = {Image collector {III:} a web image-gathering system with bag-of-keypoints}, booktitle = {Proceedings of the 16th International Conference on World Wide Web, {WWW} 2007, Banff, Alberta, Canada, May 8-12, 2007}, pages = {1295--1296}, publisher = {{ACM}}, year = {2007}, doi = {10.1145/1242572.1242816}, }
Metadata