PySpark | How To Handle Nulls In DataFrame?

Handling NULL (or None) values is a crucial task in data processing, 

as missing data can skew analysis, produce errors in data transformations, 

and degrade the performance of machine learning models. 

In PySpark, dealing with NULL values is a common operation when working with distributed 

datasets. PySpark provides several methods and techniques to detect, manage, and clean up 

missing or NULL values in a DataFrame.

In this blog post, we’ll explore how to handle NULL values in PySpark DataFrames, covering 

essential methods like filtering, filling, dropping, and replacing NULL values.

Methods to Handle NULL Values in PySpark:

PySpark provides several ways to manage NULL values effectively:

1. Detecting NULL Values: Before handling NULL values, it’s essential to detect which rows or 

columns contain missing data. In PySpark, you can use the isNull() and isNotNull() methods to 

check for NULL values in specific columns.

Example: Detecting NULLs in a Column: Let’s create a DataFrame with some NULL values and 

use isNull() to detect them.

from pyspark.sql import SparkSession

# Initialize Spark session

spark = SparkSession.builder.appName("Handle NULLs Example").getOrCreate()

# Sample DataFrame with NULL values

data = [("Alice", 25), ("Bob", None), ("Catherine", 29), (None, 22)]

columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Detect rows where 'Age' is NULL

df_nulls_age = df.filter(df["Age"].isNull())

# Show the result

df_nulls_age.show()

Output:

+----+----+

|Name| Age|

+----+----+

| Bob|null|

+----+----+

In this example, the row where Age is NULL is detected and filtered.

2. Filtering NULL Values: If you want to filter out NULL values from a specific column, 

you can use the filter() or where() methods along with the isNotNull() method to retain 

only rows that contain non-NULL values.

Example: Filtering Out NULL Values.

# Filter out rows where 'Age' is NULL

df_filtered = df.filter(df["Age"].isNotNull())

# Show the result

df_filtered.show()

Output:

+----------+---+

|      Name|Age|

+----------+---+

|     Alice| 25|

| Catherine| 29|

|     David| 22|

+----------+---+

Here, all rows with NULL values in the Age column are removed, and only non-NULL rows remain.

3. Dropping NULL Values: PySpark provides the dropna() method to drop rows (or columns) 

with NULL values. You can customize the behavior of dropna() based on how many NULL values 

you want to tolerate.

Syntax: DataFrame.dropna(how='any', thresh=None, subset=None)

Parameters:

how: Can be any (default) or all. If set to any, rows with any NULL values are dropped. 

If set to all, rows where all values are NULL are dropped.

thresh: Specifies the minimum number of non-NULL values required to retain a row.

subset: Specifies the columns to check for NULL values.

Example 1: Dropping Rows with Any NULL Values.

# Drop rows where any column has NULL values

df_dropped_any = df.dropna()

# Show the result

df_dropped_any.show()

Output:

+----------+---+

|      Name|Age|

+----------+---+

|     Alice| 25|

| Catherine| 29|

+----------+---+

In this example, both rows with any NULL values are removed.

Example 2: Dropping Rows with NULLs in Specific Columns:If you only want to drop rows 

with NULL values in specific columns, you can use the subset parameter.

# Drop rows where 'Age' is NULL

df_dropped_subset = df.dropna(subset=["Age"])

# Show the result

df_dropped_subset.show()

Output:

+----------+---+

|      Name|Age|

+----------+---+

|     Alice| 25|

| Catherine| 29|

|     David| 22|

+----------+---+

In this case, rows are dropped only if the Age column contains a NULL value.

4. Filling NULL Values: Another common way to handle NULL values is by replacing them with 

a specific value using the fillna() method. This is useful when you want to fill missing 

values with a default or placeholder value, such as 0, an empty string, or the mean of the 

column.

Syntax: DataFrame.fillna(value, subset=None)

Parameters:

value: The value used to replace NULL values. Can be a scalar or a dictionary mapping columns 

to values.

subset: Optional. A list of column names where you want to fill NULL values.

Example 1: Filling NULL Values with a Scalar: You can fill all NULL values in a column with 

a single scalar value.

# Fill NULL values in 'Age' with 0

df_filled = df.fillna(0, subset=["Age"])

# Show the result

df_filled.show()

Output:

+----------+---+

|      Name|Age|

+----------+---+

|     Alice| 25|

|       Bob|  0|

| Catherine| 29|

|     David| 22|

+----------+---+

In this example, the NULL values in the Age column are replaced with 0.

Example 2: Filling NULL Values with Different Values per Column: You can also provide a 

dictionary to specify different fill values for different columns.

# Fill NULL values in 'Age' with 0 and 'Name' with 'Unknown'

df_filled_multiple = df.fillna({"Age": 0, "Name": "Unknown"})

# Show the result

df_filled_multiple.show()

Output:

+--------+---+

|    Name|Age|

+--------+---+

|   Alice| 25|

|     Bob|  0|

|Catherine| 29|

|  Unknown| 22|

+--------+---+

Here, NULL values in the Age column are filled with 0, and NULL values in the Name column 

are filled with “Unknown”.

5. Replacing NULL Values: In some cases, you may want to replace NULL values conditionally, 

such as replacing them based on existing non-NULL values in other columns or a predefined 

condition. You can use the na.replace() method to replace NULL values with specific values.

Example: Replacing NULL Values.

# Replace NULL values in 'Name' column with 'Anonymous'

df_replaced = df.na.replace({None: "Anonymous"}, subset=["Name"])

# Show the result

df_replaced.show()

Output:

+----------+---+

|      Name|Age|

+----------+---+

|     Alice| 25|

|   Anonymous|null|

| Catherine| 29|

|   Anonymous| 22|

+----------+---+

In this example, the NULL values in the Name column are replaced with the string “Anonymous”.

Conclusion: Handling NULL values in PySpark is essential for ensuring data quality and 

integrity in your data pipeline. PySpark provides powerful methods to detect, filter, drop, 

fill, and replace NULL values in DataFrames, enabling you to clean your data efficiently. 

Whether you’re preparing data for analysis, reporting, or machine learning models, 

handling NULL values correctly will improve the accuracy and performance of your operations.

IT Tech Insight Blog