Data Quality Scoring Models for Streaming Pipelines with Rule- Based and Learned Validation
Keywords:
Streaming data quality, rule-based validation, learned validation, anomaly detection, quality scoring, sliding-window aggregation, data governance.Abstract
Streaming pipelines need continuous data quality scoring because errors can appear while events are still moving through ingestion, validation, transformation, and delivery stages. This article presents a hybrid quality scoring model that combines rule-based validation with learned anomaly detection to measure completeness, timeliness, validity, consistency, uniqueness, schema stability, distribution stability, and source reliability. The proposed model profiles each event, applies deterministic validation rules, detects abnormal stream behavior using learned signals, and aggregates scores across sliding processing windows with temporal weighting. Simulated results show that the total quality score declined during unstable windows, mainly due to timeliness and uniqueness degradation, and later recovered as the stream stabilized. The hybrid model also showed stronger diagnostic value than rule-only validation because it detected both explicit failures, such as schema drift and duplicate spikes, and hidden behavioral changes, such as distribution shift. These findings indicate that streaming data quality should be represented as an interpretable dynamic score rather than a simple pass-or-fail decision.