Leveraging large visual models for enhanced object detection: An improved SAM-YOLOv5 model

Although various object detection methods have been developed, the accuracy of existing algorithms remains insufficient, particularly for detecting small-size and long-distance objects. To address these challenges, we propose an improved object detection model, I-SAM-YOLOv5, which combines the stren...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Knowledge-based systems Ročník 330; s. 114757
Hlavní autoři:	Tang, Jun, Li, Dan, Yang, Jiawei, Chen, Jing, Yuan, Ruiping
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Elsevier B.V 25.11.2025
Témata:	Feature fusion I-SAM-YOLOv5 Large visual models Object detection I-SAM-YOLOv5 Feature fusion Large visual models Object detection
ISSN:	0950-7051
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Although various object detection methods have been developed, the accuracy of existing algorithms remains insufficient, particularly for detecting small-size and long-distance objects. To address these challenges, we propose an improved object detection model, I-SAM-YOLOv5, which combines the strength of the large vision model (SAM) and YOLOv5. The framework incorporates a large visual feature fusion (LVFF) module, wherein powerful visual features of SAM are integrated into YOLOv5 to improve feature representation. Further, an enhanced fixed-resolution feature pyramid network (FRFPN) is employed to refine and strengthen feature extraction. The experimental results on the COCO and KITTI datasets demonstrate considerable improvements in detection accuracy across almost all model scales (n,s,m,l,andx). For the scale-n model, our model achieves a significant 8.47 % increase in mean average precision (mAP) on COCO and 5.48 % improvement on KITTI compared to the YOLOv5 baseline. To further assess the effectiveness of I-SAM-YOLOv5, we conduct ablation studies examining different LVFF variants, FRFPN designs, feature fusion positions, adapters and multi-layer perceptron (MLP) configurations. The results confirm the robust performance gains of our proposed framework. This study advances object detection and extends the application of large vision models to computer vision tasks such as intelligent transportation systems.
ISSN:	0950-7051
DOI:	10.1016/j.knosys.2025.114757