Why Skewed Data Causes Problems
This time, let's look at how skewed data and the small file problem were addressed in a Spark Streaming application.
Spark Streaming Application
The Spark Streaming application receives data from Kafka and stores it in HDFS.
The application processes data in 1-minute micro-batches and partitions it into [year, month, day, hour, column_a]
.
The column_a
values range from [100, 600), but the actual distribution is as follows:
- 1xx: 100 - 110
- 2xx: 200 - 220
- 3xx: 300 - 320
- 4xx: 400 - 420
- 5xx: 500 - 520
This application occasionally experiences delays. The root cause is suspected to be the high load on the HDFS NameNode, particularly during periods of heavy data ingestion. When delays occurred, metrics revealed a surge in file operation requests to the HDFS NameNode. But why were so many file operations generated?
Small File Problem
The data stored by the Spark Streaming application is consumed by another system. This system reads all the data on a daily basis, and during these reads, a massive number of file operation requests are generated. The surprising part is that the total size of the daily data is not large.
Consider the following query:
This query aims to count the number of records with column_a
values in the '5xx' range for December 1, 2024.
Although the data size is less than 1GiB, the NameNode experiences significant load due to a large number of small files.
HDFS stores metadata for all files and their associated blocks in memory. As a result, frequent read and write operations on many small files can overwhelm the NameNode.
Skewed Data
Data skew refers to the imbalance in data distribution based on a particular column.
For column_a
, the data distribution is as follows:
- 1xx: 95%
- 2xx: 1%
- 3xx: 0.x%
- 4xx: 3%
- 5xx: 0.x%
With 95% of the data concentrated in the 1xx range, the Spark application generates many small files.
The application uses the total number of records and desired records per file to determine the number of partitions and files but does not account for the data skew in column_a
.
For instance, if there are 8,000,000 records in total, the application creates 32 partitions, each containing 250,000 records. Applying the above distribution, the non-1xx data accounts for only 5% of the total, or 400,000 records. These records are further split into smaller chunks, resulting in a large number of tiny files, causing heavy load on the HDFS NameNode.
Improved Spark Streaming Application
To address these issues, the Spark Streaming application was improved in two ways:
- Reducing the data stored.
- Reducing the number of files generated.
Reducing the Data Stored
By analyzing how other systems consumed the data, unnecessary data was filtered out during ingestion. This reduced the total data size by about 20%.
Reducing the Number of Files
Initially, increasing the micro-batch interval (e.g., from 1 minute to 2 minutes) was attempted, but this did not yield significant improvements. A dynamic partitioning approach was implemented based on the data distribution.
Count with GroupBy
The first approach was to calculate the record count for each column_a
value and adjust the partitions accordingly:
However, this approach introduced high computation costs, causing micro-batch delays of up to 6 minutes.
Row Number
The next approach assigned unique row numbers to each record without performing a full aggregation:
This method also incurred high overhead, making it unsuitable for real-time streaming.
Fixed Calculated Volume Ratio
The final solution involved pre-analyzing the data distribution and assigning partitions dynamically:
This approach effectively reduced computation overhead by leveraging pre-calculated ratios.
After Improvement
The number of files generated per day was reduced from 19,328,672 to 132,196, less than 1% of the original count. This significantly improved HDFS NameNode performance and system stability.
However, further improvements are needed to handle cases where data ingestion increases or previously minor column_a
values suddenly grow in size.
This experience reinforced the importance of addressing root causes through data analysis rather than relying on assumptions.