d
WE ARE EXPERTS IN TECHNOLOGY

Let’s Work Together

n

StatusNeo

Data Transformation in ETL – What, Why and How

  • data cleaning and structuring
  • management of missing or erroneous entries
  • column formatting and standardization
  • dataset aggregation, filtering, and joining
  • development of derived metrics
  1. Handling NULL Values

# drop rows with nulls
clean_df = df.dropna()

# fill null values with default
filled_df = df.fillna({‘age’: 0, ‘city’: ‘Unknown’})

2. Data Type Conversion

df = df.withColumn(“salary”, col(“salary”).cast(“double”))

3. Renaming/Creating Columns

# rename a column

renamed_df = df.withColumnRenamed(“dob”,”date_of_birth”)

# create a new column

df = df.withColumn(“birth_year”,year(col(“date_of_birth”)))

4. Filtering Records

filtered_df =df.filter(col(“salary”) > 50000)

5. Dropping/Selecting Columns

selected_df = df.select(“name”,”age”)

dropped_df = df.drop(“unwanted_column”)

6. GroupBy and Aggregations

agg_df = df.groupBy(“department”).agg({“salary”:”avg”})

7. Splitting and Concatenating Columns

# split full name

df =df.withColumn(“first_name”, split(col(“full_name”), “ “)[0])

# concatenate columns

df = df.withColumn(“location”, concat_ws(“, “, “city”, “state”)

8. Exploding Arrays

df = df.withColumn(“item”, explode(col(“items”)))

9. Timestamp Transformation

# convert string to timestamp

df = df.withColumn(“timestamp”, to_timestamp(col(“timestamp”), “yyyy-MM-dd HH:mm:ss”))

# format timestamp

df = df.withColumn(“formatted_time”, date_format(col(“timestamp”), “yyyy-MM-dd HH:mm:ss”))

# extract date parts

df = df.withColumn(“year”, year(col(“timestamp”))) \

          .withColumn(“month”, month(col(“timestamp”))) \

         . withColumn(“day”, dayofmonth(col(“timestamp”)))

  • For readability and performance, use withColumn and select over SQL
  • Avoid wide transformations when possible
  • Cache data if reused multiple times
  • Repartition large datasets before heavy transformations