Leveraging large visual models for enhanced object detection: An improved SAM-YOLOv5 model

Although various object detection methods have been developed, the accuracy of existing algorithms remains insufficient, particularly for detecting small-size and long-distance objects. To address these challenges, we propose an improved object detection model, I-SAM-YOLOv5, which combines the stren...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge-based systems Vol. 330; p. 114757
Main Authors:	Tang, Jun, Li, Dan, Yang, Jiawei, Chen, Jing, Yuan, Ruiping
Format:	Journal Article
Language:	English
Published:	Elsevier B.V 25.11.2025
Subjects:	Feature fusion I-SAM-YOLOv5 Large visual models Object detection I-SAM-YOLOv5 Feature fusion Large visual models Object detection
ISSN:	0950-7051
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Although various object detection methods have been developed, the accuracy of existing algorithms remains insufficient, particularly for detecting small-size and long-distance objects. To address these challenges, we propose an improved object detection model, I-SAM-YOLOv5, which combines the strength of the large vision model (SAM) and YOLOv5. The framework incorporates a large visual feature fusion (LVFF) module, wherein powerful visual features of SAM are integrated into YOLOv5 to improve feature representation. Further, an enhanced fixed-resolution feature pyramid network (FRFPN) is employed to refine and strengthen feature extraction. The experimental results on the COCO and KITTI datasets demonstrate considerable improvements in detection accuracy across almost all model scales (n,s,m,l,andx). For the scale-n model, our model achieves a significant 8.47 % increase in mean average precision (mAP) on COCO and 5.48 % improvement on KITTI compared to the YOLOv5 baseline. To further assess the effectiveness of I-SAM-YOLOv5, we conduct ablation studies examining different LVFF variants, FRFPN designs, feature fusion positions, adapters and multi-layer perceptron (MLP) configurations. The results confirm the robust performance gains of our proposed framework. This study advances object detection and extends the application of large vision models to computer vision tasks such as intelligent transportation systems.
ISSN:	0950-7051
DOI:	10.1016/j.knosys.2025.114757