Research Article 2026-04-22 in-revision v1

LDCAP: Robust Image Captioning via Latent Compression and Dynamic Decoder Conditioning

V
Veerababu Reddy Vignan’s Lara Institute of Technology and Science
S
Seetharam Poola Vignan’s Lara Institute of Technology and Science
M
Musharaf Shaik Vignan’s Lara Institute of Technology and Science
S
Sai Durga Kistaparapu Vignan’s Lara Institute of Technology and Science
V
Vignesh Sajja Vignan’s Lara Institute of Technology and Science

Abstract

Generating accurate natural language descriptions from images remains a challenging task, particularly when input images are captured under poor lighting conditions such as dim indoor environments, nighttime scenes, or strongly backlit settings. Under such conditions, region-level visual features extracted by standard object detection networks become noisy and unreliable, causing attention mechanisms to focus on uninformative image regions and ultimately degrading caption quality. This paper presents LDCAP (Latent Compression and Dynamic Decoder Conditioning for Image Captioning), a transformer-based captioning model that addresses visual feature degradation directly within the network architecture, without requiring any external image preprocessing or enhancement step. LDCAP incorporates three targeted architectural modifications over its SCAP baseline. First, the encoder is redesigned using a Recurrent Interface Network (RIN), which compresses variable-length region features into a fixed set of 64 learnable latent tokens through iterative cross-attention, forming a structured information bottleneck that naturally suppresses noise-dominated feature dimensions. Second, Feature-wise Linear Modulation (FiLM) layers are integrated into each decoder block, enabling global scene context to dynamically condition hidden representations at every caption generation step, complementing the local cross-attention mechanism. Third, a two-stage training strategy is employed, combining cross-entropy pre-training with self-critical sequence training (SCST), which directly aligns the optimisation objective with standard captioning evaluation metrics. The complete model contains 31.8M parameters, remaining compact relative to large-scale vision-language pre-trained models while achieving competitive performance. Experimental evaluation on the MS-COCO 2014 benchmark demonstrates that LDCAP achieves a CIDEr score of 134.2 (±0.4), improving upon the SCAP baseline of 131.7, with consistent gains across METEOR (32.9), ROUGE-L (62.8), BLEU-1 (85.2), and BLEU-4 (39.8). Zero-shot evaluation on Flickr30k confirms that the improvements generalise across datasets, with LDCAP reaching a CIDEr score of 135.4 compared to 132.8 for SCAP. The advantage of LDCAP is most pronounced under degraded illumination, where it outperforms SCAP by 4.4 CIDEr points under low-light conditions versus 2.2 points under normal lighting. Controlled experiments with synthetic gamma degradation at four severity levels confirm that the performance gap widens monotonically as illumination deteriorates, and that internal architectural robustness consistently outperforms CLAHE-based external preprocessing at every degradation level. Ablation experiments validate that all three proposed components contribute independently and positively to overall performance, and attention visualisations demonstrate that LDCAP produces more focused and semantically meaningful attention patterns under challenging visual conditions. The source code is publicly available at : doi.org/10.5281/zenodo.19626529}{doi.org/10.5281/zenodo.19626529.

Citation Information

@article{veerababureddy2026,
  title={LDCAP: Robust Image Captioning via Latent Compression and Dynamic Decoder Conditioning},
  author={Veerababu Reddy and Seetharam Poola and Musharaf Shaik and Sai Durga Kistaparapu and Vignesh Sajja},
  journal={The Visual Computer},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-9480848/v1}
}
Back to Top
Home
Paper List
Submit
0.019959s