SuperCluster: Learning Prediction Prototypes via Target-Guided Cross-Attention Clustering
Abstract
Clustering is widely used to discover interpretable structure in data, but conventional unsupervised methods group examples by \emph{feature similarity} alone, ignoring any available prediction target. When the goal is to explain \emph{why} certain examples are predicted differently, feature-similar clusters can be misleading: they may lump together examples with very different outcomes. We propose \textbf{SuperCluster}, a method that learns $K$ prediction prototypes jointly with a supervised objective. Examples are routed to prototypes via a cross-attention mechanism, and the final prediction is a weighted average of the prototype logits. This forces the model to discover $K$ predictive archetypes: groups that are both feature-coherent and share a common prediction regime. We evaluate SuperCluster on four tabular binary classification benchmarks (NBA shot prediction, bank telemarketing subscription, adult income, and credit card default) and a 7-class benchmark (UCI Covertype, $n=581$k), showing that target-guided clustering produces clusters qualitatively different from feature-only k-means clusters, matches or exceeds a deep MLP in accuracy on all five datasets, and consistently outperforms an unsupervised clustering baseline by 1.6--4.3 percentage points in AUC on the binary tasks. A key design choice separates prototype geometry (used for routing) from prototype prediction scores (used for output), so that each prototype's predicted probability $\sigma(s_k)$ is directly readable without going through a downstream linear head. Beyond prediction and interpretation, SuperCluster provides an operational tool for estimating the effective number of latent prediction regimes: over-specify the prototype budget $K$ and examine which prototypes are stably occupied across seeds and values of $K$. On Covertype, the model activates 6 prototypes that map closely onto distinct forest cover types, suggesting that $K_{\text{regime}}$ may grow with prediction dimensionality $C$.
Keywords
Citation Information
@article{aarondanielson2026,
title={SuperCluster: Learning Prediction Prototypes via Target-Guided Cross-Attention Clustering},
author={Aaron Danielson},
journal={Research Square},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9163443/v1}
}
SinoXiv