Article 2026-04-21 under-review v1

A Structure-First Paradigm for Morphological Parsing: Synthesizing Discrete Representation and Diffusion Model

Z
Zhan Chen Beijing Normal University
F
Fangzhou Liu Tsinghua University
M
Martijn Naaijer University of Zurich
W
Willem Th. van Peursen Vrije Universiteit Amsterdam

Abstract

Under the ETCBC encoding system, morphological parsing is a rigorous recon- struction of internal word structures rather than a simple tagging task. While contemporary NLP paradigms emphasize performance gains through data accumulation, we demonstrate that in Classical Syriac, a strictly bounded corpus, scaling training data produces counterintuitive results: expanding the training setwith distributionally divergent, Out-of-Distribution (OOD) data fails to yield positive transfer. To overcome this, we propose a Structure-First solution, which prioritizes representational and architectural constraints over raw data scale. This paradigm integrates two synergistic interventions: 1) a Discretization Strategy that maps variable-length morphological strings into atomic primi- tives, establishing the fixed-length structural alignment necessary for the models employed in the following step, and 2) an Encoder-Only Classifier, which leads to a Masked Diffusion Model that employs iterative denoising to capture global morphological dependencies. Unlike autoregressive models limited by lin- ear error propagation, the diffusion mechanismenables the model to dynamically resolve ambiguities through an evolving global context—a process mirroring the non-linear cognitive workflow of expert philologists. Our approach achieves a state-of-the-art Character Error Rate (CER) of 3.42%, successfully overcoming the performance plateau effect. These findings suggest that for the parsingoflow- resource historical languages, optimized structural representation and superior architectural improvement prove more effective than indiscriminate data scaling.

Citation Information

@article{zhanchen2026,
  title={A Structure-First Paradigm for Morphological Parsing: Synthesizing Discrete Representation and Diffusion Model},
  author={Zhan Chen and Fangzhou Liu and Martijn Naaijer and Willem Th. van Peursen},
  journal={Humanities and Social Sciences Communications},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-8509249/v1}
}
Back to Top
Home
Paper List
Submit
0.024094s