A cosine similarity-based token subsampling method for vision transformer in cloud computing

Deploying huge deep learning applications on resource-constrained edge devices is a challenging task. Cloud-based edge computing is a promising solution. Such as model partitioning, a portion of the deep learning model is deployed on the edge device; while, the remaining portion is executed by the c...

Full description

Saved in:

Bibliographic Details
Published in:	Neural computing & applications Vol. 37; no. 4; pp. 2627 - 2639
Main Authors:	Li, Qi, Kaneko, Hayata, Meng, Lin
Format:	Journal Article
Language:	English
Published:	London Springer London 01.02.2025 Springer Nature B.V
Subjects:	Algorithms Artificial Intelligence Centroids Cloud computing Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Data Mining and Knowledge Discovery Deep learning Edge computing Image Processing and Computer Vision Original Article Partitioning Probability and Statistics in Computer Science Similarity Transmission efficiency Vision Visual tasks Model partitioning Cosine similarity Vision transformer Cloud computing Token clustering Edge computing
ISSN:	0941-0643, 1433-3058
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Deploying huge deep learning applications on resource-constrained edge devices is a challenging task. Cloud-based edge computing is a promising solution. Such as model partitioning, a portion of the deep learning model is deployed on the edge device; while, the remaining portion is executed by the cloud. Leveraging the computation power of edge devices, transmission latency is reduced, and bandwidth efficiency is increased. Recently, visual transformer models, supported by large datasets, have dominated in multiple vision tasks. However, model partitioning optimization methods for visual transformers are lacking. Therefore, the paper proposes a cosine similarity-based token subsampling method for visual transformer model partitioning to improve transmission efficiency. Tokens in the same class are subsampled and only the centroid tokens are uploaded. In the cloud, all tokens are reconstructed based on interpolation indexes. Three algorithm implementations are proposed and measured on PC, Jetson NANO and edge CPU Cortex-A53. The experimental results demonstrate that the recommended algorithm implementation can be executed with low-latency of 71.24 ms, and 35.65% transmitted data is reduced with an accuracy drop of 0.46%.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-024-10718-w