Multi-Modal Phishing Website Detection with Real-Time Rendering Signals and TLS Fingerprints
Abstract
Phishing website detection often relies solely on lexical or HTML features, which makes classifiers fragile against obfuscated URLs and template-based page cloning. This study develops a multi-modal detection framework that combines URL lexical embeddings, DOM structural features, page rendering signals (such as font entropy and visual layout similarity), and TLS certificate fingerprints. We build a dataset of 950,000 URLs, including 160,000 confirmed phishing instances collected from browser telemetry and public feeds over nine months. Character-level CNNs encode URLs, while a gradient boosting model integrates DOM and TLS features. A small Siamese CNN compares rendered screenshots with a benign-template bank to capture near-duplicate phishing pages. The framework achieves an AUC of 0.987, recall of 95.4%, and reduces false positives by 19.3% compared with a strong lexical-only baseline. Online experiments in a proxy-based deployment show that median detection latency remains below 20 ms per request. The results indicate that combining transport-layer fingerprints and rendering behavior yields robust, real-time phishing detection suitable for production environments.
Keywords
Citation Information
@article{emilykdawson2026,
title={Multi-Modal Phishing Website Detection with Real-Time Rendering Signals and TLS Fingerprints},
author={Emily K. Dawson and Sophie L. Cartwright and James R. Whitfield},
journal={Research Square},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9460394/v1}
}
SinoXiv