Why Skewed Data Causes Problems

This time, let's look at how skewed data and the small file problem were addressed in a Spark Streaming application.

Spark Streaming Application

The Spark Streaming application receives data from Kafka and stores it in HDFS. The application processes data in 1-minute micro-batches and partitions it into [year, month, day, hour, column_a]. The column_a values range from [100, 600), but the actual distribution is as follows:

1xx: 100 - 110
2xx: 200 - 220
3xx: 300 - 320
4xx: 400 - 420
5xx: 500 - 520

This application occasionally experiences delays. The root cause is suspected to be the high load on the HDFS NameNode, particularly during periods of heavy data ingestion. When delays occurred, metrics revealed a surge in file operation requests to the HDFS NameNode. But why were so many file operations generated?

Small File Problem

The data stored by the Spark Streaming application is consumed by another system. This system reads all the data on a daily basis, and during these reads, a massive number of file operation requests are generated. The surprising part is that the total size of the daily data is not large.

Consider the following query:

123456
SELECT COUNT(*)
FROM db.table
WHERE year = 2024
  AND month = 202412
  AND day = 20241201
  AND column_a IN (500, 501, 502, ..., 519);

This query aims to count the number of records with column_a values in the '5xx' range for December 1, 2024. Although the data size is less than 1GiB, the NameNode experiences significant load due to a large number of small files.

HDFS stores metadata for all files and their associated blocks in memory. As a result, frequent read and write operations on many small files can overwhelm the NameNode.

Skewed Data

Data skew refers to the imbalance in data distribution based on a particular column. For column_a, the data distribution is as follows:

1xx: 95%
2xx: 1%
3xx: 0.x%
4xx: 3%
5xx: 0.x%

With 95% of the data concentrated in the 1xx range, the Spark application generates many small files. The application uses the total number of records and desired records per file to determine the number of partitions and files but does not account for the data skew in column_a.

For instance, if there are 8,000,000 records in total, the application creates 32 partitions, each containing 250,000 records. Applying the above distribution, the non-1xx data accounts for only 5% of the total, or 400,000 records. These records are further split into smaller chunks, resulting in a large number of tiny files, causing heavy load on the HDFS NameNode.

Improved Spark Streaming Application

To address these issues, the Spark Streaming application was improved in two ways:

Reducing the data stored.
Reducing the number of files generated.

Reducing the Data Stored

By analyzing how other systems consumed the data, unnecessary data was filtered out during ingestion. This reduced the total data size by about 20%.

Reducing the Number of Files

Initially, increasing the micro-batch interval (e.g., from 1 minute to 2 minutes) was attempted, but this did not yield significant improvements. A dynamic partitioning approach was implemented based on the data distribution.

Count with GroupBy

The first approach was to calculate the record count for each column_a value and adjust the partitions accordingly:

12345678
val df = ...
val keyToCount = df.groupBy(col("column_a")).count()
  .withColumnRenamed("column_a", "_column_a")

val repartitionedDf: DataFrame = df.join(keyToCount, df("column_a") === keyToCount("_column_a"))
  .withColumn("_repartition_seed", ((rand() * keyToCount("count")) / 250000).cast(IntegerType))
  .repartition(df("column_a"), col("repartition_seed"))
  .select(df("*"))

However, this approach introduced high computation costs, causing micro-batch delays of up to 6 minutes.

Row Number

The next approach assigned unique row numbers to each record without performing a full aggregation:

12
val repartitionedDf: DataFrame = df.withColumn("_row_number", row_number().over(Window.partitionBy("column_a").orderBy("column_a")) - 1)
  .withColumn("_repartition_seed", (col("_row_number") / 250000).cast(IntegerType))

This method also incurred high overhead, making it unsuitable for real-time streaming.

Fixed Calculated Volume Ratio

The final solution involved pre-analyzing the data distribution and assigning partitions dynamically:

12345678
val dfWithRepartitionSeed = df
  .withColumn("_repartition_seed",
    when(col("column_a").isin(Percentages.filter(_._2 >= ThresholdPercentage).keys.toSeq: _*),
      concat(col("column_a"), lit("_"), (rand() * ceil(getPercentage(col("column_a")) * (numPartition - 2) / 100)).cast("int"))
  ).otherwise(concat(lit("999_"), (rand() * 2).cast("int"))))

val repartitionedDf = dfWithRepartitionSeed
  .repartition(col("_repartition_seed"))

This approach effectively reduced computation overhead by leveraging pre-calculated ratios.

After Improvement

The number of files generated per day was reduced from 19,328,672 to 132,196, less than 1% of the original count. This significantly improved HDFS NameNode performance and system stability.

However, further improvements are needed to handle cases where data ingestion increases or previously minor column_a values suddenly grow in size.

This experience reinforced the importance of addressing root causes through data analysis rather than relying on assumptions.