Low-Latency Data Lake Ingestion with Adaptive Partitioning and Query-Aware File Layouts
Keywords:
Data lake ingestion, adaptive partitioning, query-aware file layout, metadata indexing, partition pruning, file compaction, low-latency analytics, read amplification.Abstract
Low-latency data lake ingestion is essential for enterprise analytics because newly arrived data must become query-ready without creating inefficient partitions, small-file overhead, or excessive metadata scans. This article presents an adaptive ingestion framework that combines source-aware landing-zone control, dynamic partition key selection, query-aware file layout planning, small-file compaction, metadata indexing, latency-aware write scheduling, and schema compatibility validation. The proposed framework reduces the separation between ingestion design and query optimization by using workload patterns during file placement and partition planning. The simulated results show that ingestion latency decreased from 142 ms under static partitioning to 68 ms under the full query-aware layout framework. File compaction efficiency increased from 61.5% to 94.2%, and partition pruning gain improved from 48.7% to 91.6%. Query response time also decreased from 4.8 s to 1.6 s, while metadata scan reduction improved from 35.2% to 88.7% and read amplification control increased from 42.5% to 90.4%. These findings indicate that adaptive partitioning and query-aware file layouts can improve both data freshness and analytical efficiency in large-scale data lake environments.