Multi-Scale Semantic Alignment for Enhanced Image-Text Retrieval
Abstract
Cross-modal image-text retrieval aims to precisely match visual content with natural language descriptions, a task pivotal in multimodal understanding. Despite advancements in feature extraction and alignment, mainstream methods are constrained by isolated global or local matching strategies. This paper introduces a Cross-Modal Retrieval Network with Multi-Scale Feature Enhancement and Semantic-Aware Adaptive Fusion. The proposed method integrates visual and textual representations across different hierarchies, dynamically filters redundant visual information, and intensifies focus on regionally relevant image-text pairs. An adaptive fusion mechanism, guided by semantic complexity, intelligently balances global and local similarities. Experiments on Flickr30K and MS-COCO datasets demonstrate superior performance, validating the effectiveness and robustness of our approach.Our code is available at https://github.com/SUZILI7/MSSA
Keywords
Citation Information
@article{haifangmo2026,
title={Multi-Scale Semantic Alignment for Enhanced Image-Text Retrieval},
author={Haifang Mo and Zili Su and Huili Zhang and Meng Xia and Linhui Cheng and Qian Luo},
journal={The Visual Computer},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-8805622/v1}
}
SinoXiv