quotation: | | [Copy] |
| | Lin Chen1,2,Yungang Liu1,Yuan Wang1.[en_title][J].Control Theory and Technology,2025,23(2):221~236.[Copy] |
|
|
|
This Paper:Browse 72 Download 0 |
 码上扫一扫! |
Effective convolution mixed Transformer Siamese network for robust visual tracking |
LinChen1,2,YungangLiu1,YuanWang1 |
|
(1 School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China
2 School of Computer Science, Liaocheng University, Liaocheng, 252000, Shandong, China) |
|
摘要: |
Siamese tracking algorithms usually take convolutional neural networks (CNNs) as feature extractors owing to their capability of extracting deep discriminative features. However, the convolution kernels in CNNs have limited receptive fields, making it difficult to capture global feature dependencies which is important for object detection, especially when the target undergoes large-scale variations or movement. In view of this, we develop a novel network called effective convolution mixed Transformer Siamese network (SiamCMT) for visual tracking, which integrates CNN-based and Transformer-based architectures to capture both local information and long-range dependencies. Specifically, we design a Transformer-based module named lightweight multi-head attention (LWMHA) which can be flexibly embedded into stage-wise CNNs and improve the network’s representation ability. Additionally, we introduce a stage-wise feature aggregation mechanism which integrates features learned from multiple stages. By leveraging both location and semantic information, this mechanism helps the SiamCMT to better locate and find the target. Moreover, to distinguish the contribution of different channels, a channel-wise attention mechanism is introduced to enhance the important channels and suppress the others. Extensive experiments on seven challenging benchmarks, i.e., OTB2015, UAV123, GOT10K, LaSOT, DTB70, UAVTrack112_L, and VOT2018, demonstrate the effectiveness of the proposed algorithm. Specially, the proposed method outperforms the baseline by 3.5% and 3.1% in terms of precision and success rates with a real-time speed of 59.77 FPS on UAV123. |
关键词: Visual tracking · Siamese network · Transformer · Feature aggregation · Channel-wise attention |
DOI:https://doi.org/10.1007/s11768-025-00251-z |
|
基金项目:This work was partly supported by the National Natural Science Foundation of China (Grant No. 62033007), and the Major Fundamental Research Program of Shandong Province (Grant No.ZR2023ZD37). |
|
Effective convolution mixed Transformer Siamese network for robust visual tracking |
Lin Chen1,2,Yungang Liu1,Yuan Wang1 |
(1 School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China
2 School of Computer Science, Liaocheng University, Liaocheng, 252000, Shandong, China) |
Abstract: |
Siamese tracking algorithms usually take convolutional neural networks (CNNs) as feature extractors owing to their capability of extracting deep discriminative features. However, the convolution kernels in CNNs have limited receptive fields, making it difficult to capture global feature dependencies which is important for object detection, especially when the target undergoes large-scale variations or movement. In view of this, we develop a novel network called effective convolution mixed Transformer Siamese network (SiamCMT) for visual tracking, which integrates CNN-based and Transformer-based architectures to capture both local information and long-range dependencies. Specifically, we design a Transformer-based module named lightweight multi-head attention (LWMHA) which can be flexibly embedded into stage-wise CNNs and improve the network’s representation ability. Additionally, we introduce a stage-wise feature aggregation mechanism which integrates features learned from multiple stages. By leveraging both location and semantic information, this mechanism helps the SiamCMT to better locate and find the target. Moreover, to distinguish the contribution of different channels, a channel-wise attention mechanism is introduced to enhance the important channels and suppress the others. Extensive experiments on seven challenging benchmarks, i.e., OTB2015, UAV123, GOT10K, LaSOT, DTB70, UAVTrack112_L, and VOT2018, demonstrate the effectiveness of the proposed algorithm. Specially, the proposed method outperforms the baseline by 3.5% and 3.1% in terms of precision and success rates with a real-time speed of 59.77 FPS on UAV123. |
Key words: Visual tracking · Siamese network · Transformer · Feature aggregation · Channel-wise attention |
|
|
|
|
|