ECCT: Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation
Image clustering aims to divide a set of unlabeled images into multiple clusters. Recently, clustering methods based on contrastive learning have attracted much attention due to their ability to learn discriminative feature representations. Nevertheless, existing clustering algorithms face challenge...
Uložené v:
| Vydané v: | Neural networks Ročník 180; s. 106684 |
|---|---|
| Hlavní autori: | , , , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
United States
Elsevier Ltd
01.12.2024
|
| Predmet: | |
| ISSN: | 0893-6080, 1879-2782, 1879-2782 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | Image clustering aims to divide a set of unlabeled images into multiple clusters. Recently, clustering methods based on contrastive learning have attracted much attention due to their ability to learn discriminative feature representations. Nevertheless, existing clustering algorithms face challenges in capturing global information and preserving semantic continuity. Additionally, these methods often exhibit relatively singular feature distributions, limiting the full potential of contrastive learning in clustering. These problems can have a negative impact on the performance of image clustering. To address the above problems, we propose a deep clustering framework termed Efficient Contrastive Clustering via Pseudo-Siamese Vision Transformer and Multi-view Augmentation (ECCT). The core idea is to introduce Vision Transformer (ViT) to provide the global view, and improve it with Hilbert Patch Embedding (HPE) module to construct a new ViT branch. Finally, we fuse the features extracted from the two ViT branches to obtain both global view and semantic coherence. In addition, we employ multi-view random aggressive augmentation to broaden the feature distribution, enabling the model to learn more comprehensive and richer contrastive features. Our results on five datasets demonstrate that ECCT outperforms previous clustering methods. In particular, the ARI metric of ECCT on the STL-10 (ImageNet-Dogs) dataset is 0.852 (0.424), which is 10.3% (4.8%) higher than the best baseline.
•An Efficient Image Clustering method based on Contrastive learning is proposed.•Multi-view augmentation is designed to expand the feature space distribution.•Global view and semantic coherence in images are considered simultaneously.•Vision Transformer is introduced into image clustering to provide global view.•Hilbert Patch Embedding module is designed to address the semantic coherence. |
|---|---|
| Bibliografia: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0893-6080 1879-2782 1879-2782 |
| DOI: | 10.1016/j.neunet.2024.106684 |