Skip to content

Latest commit

 

History

History
91 lines (57 loc) · 2.72 KB

chapter_5.md

File metadata and controls

91 lines (57 loc) · 2.72 KB

Chapter 5: Working with Column Expressions

Column expressions and functions allow you to perform operations on data within columns, including mathematical computations, string manipulation, date arithmetic, and handling missing data. This chapter explores some of the common column operations.

Using Column Functions

Column functions can be used to transform data in various ways. Let's look at examples of string manipulation, mathematical operations, and date functions.

Scala Example

import org.apache.spark.sql.functions._

// String manipulation
df.withColumn("upperName", upper(col("name"))).show()

// Mathematical operations
df.withColumn("salaryIncrease", col("salary") * 1.1).show()

// Date operations
df.withColumn("nextDay", date_add(col("startDate"), 1)).show()

Python (PySpark) Example

from pyspark.sql.functions import upper, col, date_add

# String manipulation
df.withColumn("upperName", upper(col("name"))).show()

# Mathematical operations
df.withColumn("salaryIncrease", col("salary") * 1.1).show()

# Date operations
df.withColumn("nextDay", date_add(col("startDate"), 1)).show()

In these examples, we use withColumn to create a new DataFrame with an additional column that results from the operation specified. We apply string uppercasing, a mathematical operation to increase salary, and a date operation to calculate the next day.

Handling Missing Data

Dealing with missing data is a common task in data processing. Spark provides several options for handling nulls, including dropping rows with null values, filling them with a default value, or selecting only rows that meet certain conditions.

Dropping Rows with Null Values

Scala Example

df.na.drop().show()

Python (PySpark) Example

df.na.drop().show()

Filling Null Values

Scala Example

df.na.fill("unknown", Seq("name")).show()

Python (PySpark) Example

df.na.fill("unknown", ["name"]).show()

Filtering Rows Based on Null Conditions

Scala Example

df.filter(col("name").isNotNull).show()

Python (PySpark) Example

df.filter(df.name.isNotNull()).show()

Handling missing data appropriately is crucial for maintaining the integrity of your analysis. Spark's DataFrame API provides robust tools for dealing with nulls, allowing you to clean and prepare your data for analysis effectively.


This chapter delves into the manipulation and transformation of column data in Spark DataFrames, covering essential techniques for working with column expressions and handling missing data. These operations are vital for data cleaning, preparation, and feature engineering in data analysis and machine learning projects.