ECCT: Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation

Image clustering aims to divide a set of unlabeled images into multiple clusters. Recently, clustering methods based on contrastive learning have attracted much attention due to their ability to learn discriminative feature representations. Nevertheless, existing clustering algorithms face challenge...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Neural networks Ročník 180; s. 106684
Hlavní autori: Wei, Xing, Hu, Taizhang, Wu, Di, Yang, Fan, Zhao, Chong, Lu, Yang
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: United States Elsevier Ltd 01.12.2024
Predmet:
ISSN:0893-6080, 1879-2782, 1879-2782
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Image clustering aims to divide a set of unlabeled images into multiple clusters. Recently, clustering methods based on contrastive learning have attracted much attention due to their ability to learn discriminative feature representations. Nevertheless, existing clustering algorithms face challenges in capturing global information and preserving semantic continuity. Additionally, these methods often exhibit relatively singular feature distributions, limiting the full potential of contrastive learning in clustering. These problems can have a negative impact on the performance of image clustering. To address the above problems, we propose a deep clustering framework termed Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation (ECCT). The core idea is to introduce Vision Transformer (ViT) to provide the global view, and improve it with Hilbert Patch Embedding (HPE) module to construct a new ViT branch. Finally, we fuse the features extracted from the two ViT branches to obtain both global view and semantic coherence. In addition, we employ multi-view random aggressive augmentation to broaden the feature distribution, enabling the model to learn more comprehensive and richer contrastive features. Our results on five datasets demonstrate that ECCT outperforms previous clustering methods. In particular, the ARI metric of ECCT on the STL-10 (ImageNet-Dogs) dataset is 0.852 (0.424), which is 10.3% (4.8%) higher than the best baseline. •An Efficient Image Clustering method based on Contrastive learning is proposed.•Multi-view augmentation is designed to expand the feature space distribution.•Global view and semantic coherence in images are considered simultaneously.•Vision Transformer is introduced into image clustering to provide global view.•Hilbert Patch Embedding module is designed to address the semantic coherence.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0893-6080
1879-2782
1879-2782
DOI:10.1016/j.neunet.2024.106684