Data Transformation in ETL – What, Why and How
In the world of data engineering, raw data is often unstructured, incomplete, or inconsistent. Therefore, in order to extract useful insights, it must be cleaned and transformed before being loaded into downstream systems. In this blog, we will first explore some of the important data transformation methods using PySpark.
Why Data Transformation matters
Raw data from sources like files, logs, relational databases, and APIs is usually not ready for analysis. Data Transformation helps in :
- data cleaning and structuring
- management of missing or erroneous entries
- column formatting and standardization
- dataset aggregation, filtering, and joining
- development of derived metrics
Some Important Data Transformations in PySpark
Below are some of the important transformations that form an essential part of any ETL job.
- Handling NULL Values
# drop rows with nulls
clean_df = df.dropna()
# fill null values with default
filled_df = df.fillna({‘age’: 0, ‘city’: ‘Unknown’})
2. Data Type Conversion
df = df.withColumn(“salary”, col(“salary”).cast(“double”))
3. Renaming/Creating Columns
# rename a column
renamed_df = df.withColumnRenamed(“dob”,”date_of_birth”)
# create a new column
df = df.withColumn(“birth_year”,year(col(“date_of_birth”)))
4. Filtering Records
filtered_df =df.filter(col(“salary”) > 50000)
5. Dropping/Selecting Columns
selected_df = df.select(“name”,”age”)
dropped_df = df.drop(“unwanted_column”)
6. GroupBy and Aggregations
agg_df = df.groupBy(“department”).agg({“salary”:”avg”})
7. Splitting and Concatenating Columns
# split full name
df =df.withColumn(“first_name”, split(col(“full_name”), “ “)[0])
# concatenate columns
df = df.withColumn(“location”, concat_ws(“, “, “city”, “state”)
8. Exploding Arrays
df = df.withColumn(“item”, explode(col(“items”)))
9. Timestamp Transformation
# convert string to timestamp
df = df.withColumn(“timestamp”, to_timestamp(col(“timestamp”), “yyyy-MM-dd HH:mm:ss”))
# format timestamp
df = df.withColumn(“formatted_time”, date_format(col(“timestamp”), “yyyy-MM-dd HH:mm:ss”))
# extract date parts
df = df.withColumn(“year”, year(col(“timestamp”))) \
.withColumn(“month”, month(col(“timestamp”))) \
. withColumn(“day”, dayofmonth(col(“timestamp”)))
Best Practices for PySpark Data Transformation
- For readability and performance, use withColumn and select over SQL
- Avoid wide transformations when possible
- Cache data if reused multiple times
- Repartition large datasets before heavy transformations
Conclusion
Data transformation is a crucial step in any ETL process, ensuring that raw data is cleansed, structured, and optimized for analysis. Moreover, whether it’s filtering, aggregating, normalizing, or handling timestamps, each transformation plays a key role in improving data quality and usability. As a result, a well-designed transformation step enhances reporting accuracy and ensures that downstream systems receive data in the most efficient and reliable format. Therefore, one should invest time in carefully designing all possible transformation logic to unlock the true value of the data.