ECCT: Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation

Image clustering aims to divide a set of unlabeled images into multiple clusters. Recently, clustering methods based on contrastive learning have attracted much attention due to their ability to learn discriminative feature representations. Nevertheless, existing clustering algorithms face challenge...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Neural networks Jg. 180; S. 106684
Hauptverfasser: Wei, Xing, Hu, Taizhang, Wu, Di, Yang, Fan, Zhao, Chong, Lu, Yang
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States Elsevier Ltd 01.12.2024
Schlagworte:
ISSN:0893-6080, 1879-2782, 1879-2782
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Image clustering aims to divide a set of unlabeled images into multiple clusters. Recently, clustering methods based on contrastive learning have attracted much attention due to their ability to learn discriminative feature representations. Nevertheless, existing clustering algorithms face challenges in capturing global information and preserving semantic continuity. Additionally, these methods often exhibit relatively singular feature distributions, limiting the full potential of contrastive learning in clustering. These problems can have a negative impact on the performance of image clustering. To address the above problems, we propose a deep clustering framework termed Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation (ECCT). The core idea is to introduce Vision Transformer (ViT) to provide the global view, and improve it with Hilbert Patch Embedding (HPE) module to construct a new ViT branch. Finally, we fuse the features extracted from the two ViT branches to obtain both global view and semantic coherence. In addition, we employ multi-view random aggressive augmentation to broaden the feature distribution, enabling the model to learn more comprehensive and richer contrastive features. Our results on five datasets demonstrate that ECCT outperforms previous clustering methods. In particular, the ARI metric of ECCT on the STL-10 (ImageNet-Dogs) dataset is 0.852 (0.424), which is 10.3% (4.8%) higher than the best baseline. •An Efficient Image Clustering method based on Contrastive learning is proposed.•Multi-view augmentation is designed to expand the feature space distribution.•Global view and semantic coherence in images are considered simultaneously.•Vision Transformer is introduced into image clustering to provide global view.•Hilbert Patch Embedding module is designed to address the semantic coherence.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0893-6080
1879-2782
1879-2782
DOI:10.1016/j.neunet.2024.106684