Research Article 2026-04-21 under-review v1

Multi-Scale Semantic Alignment for Enhanced Image-Text Retrieval

H
Haifang Mo South-Central Minzu University
Z
Zili Su South-Central Minzu University
H
Huili Zhang South-Central Minzu University
M
Meng Xia South-Central Minzu University
L
Linhui Cheng South-Central Minzu University
Q
Qian Luo South-Central Minzu University

Abstract

Cross-modal image-text retrieval aims to precisely match visual content with natural language descriptions, a task pivotal in multimodal understanding. Despite advancements in feature extraction and alignment, mainstream methods are constrained by isolated global or local matching strategies. This paper introduces a Cross-Modal Retrieval Network with Multi-Scale Feature Enhancement and Semantic-Aware Adaptive Fusion. The proposed method integrates visual and textual representations across different hierarchies, dynamically filters redundant visual information, and intensifies focus on regionally relevant image-text pairs. An adaptive fusion mechanism, guided by semantic complexity, intelligently balances global and local similarities. Experiments on Flickr30K and MS-COCO datasets demonstrate superior performance, validating the effectiveness and robustness of our approach.Our code is available at https://github.com/SUZILI7/MSSA

Citation Information

@article{haifangmo2026,
  title={Multi-Scale Semantic Alignment for Enhanced Image-Text Retrieval},
  author={Haifang Mo and Zili Su and Huili Zhang and Meng Xia and Linhui Cheng and Qian Luo},
  journal={The Visual Computer},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-8805622/v1}
}
Back to Top
Home
Paper List
Submit
0.026467s