Skip to main content

PySpark | How To Handle Nulls In DataFrame?



Handling NULL (or None) values is a crucial task in data processing, 
as missing data can skew analysis, produce errors in data transformations, 
and degrade the performance of machine learning models. 
In PySpark, dealing with NULL values is a common operation when working with distributed 
datasets. PySpark provides several methods and techniques to detect, manage, and clean up 
missing or NULL values in a DataFrame.

In this blog post, we’ll explore how to handle NULL values in PySpark DataFrames, covering 
essential methods like filtering, filling, dropping, and replacing NULL values.

Methods to Handle NULL Values in PySpark:

PySpark provides several ways to manage NULL values effectively:
  1. Detecting NULLs: Identifying rows or columns with NULL values.
  2. Filtering: Excluding NULL values from the DataFrame.
  3. Dropping: Removing rows or columns with NULL values.
  4. Filling: Replacing NULL values with a specific value.
  5. Replacing: Substituting NULL values based on certain conditions.

1. Detecting NULL Values: Before handling NULL values, it’s essential to detect which rows or 
columns contain missing data. In PySpark, you can use the isNull() and isNotNull() methods to 
check for NULL values in specific columns.

Example: Detecting NULLs in a Column: Let’s create a DataFrame with some NULL values and 
use isNull() to detect them.

from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("Handle NULLs Example").getOrCreate()
# Sample DataFrame with NULL values
data = [("Alice", 25), ("Bob", None), ("Catherine", 29), (None, 22)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Detect rows where 'Age' is NULL
df_nulls_age = df.filter(df["Age"].isNull())
# Show the result
df_nulls_age.show()
Output:

+----+----+
|Name| Age|
+----+----+
| Bob|null|
+----+----+
In this example, the row where Age is NULL is detected and filtered.

2. Filtering NULL Values: If you want to filter out NULL values from a specific column, 
you can use the filter() or where() methods along with the isNotNull() method to retain 
only rows that contain non-NULL values.

Example: Filtering Out NULL Values.

# Filter out rows where 'Age' is NULL
df_filtered = df.filter(df["Age"].isNotNull())
# Show the result
df_filtered.show()
Output:

+----------+---+
| Name|Age|
+----------+---+
| Alice| 25|
| Catherine| 29|
| David| 22|
+----------+---+
Here, all rows with NULL values in the Age column are removed, and only non-NULL rows remain.

3. Dropping NULL Values: PySpark provides the dropna() method to drop rows (or columns) 
with NULL values. You can customize the behavior of dropna() based on how many NULL values 
you want to tolerate.

Syntax: DataFrame.dropna(how='any', thresh=None, subset=None)
Parameters:
how: Can be any (default) or all. If set to any, rows with any NULL values are dropped. 
If set to all, rows where all values are NULL are dropped.
thresh: Specifies the minimum number of non-NULL values required to retain a row.
subset: Specifies the columns to check for NULL values.
Example 1: Dropping Rows with Any NULL Values.

# Drop rows where any column has NULL values
df_dropped_any = df.dropna()
# Show the result
df_dropped_any.show()
Output:

+----------+---+
| Name|Age|
+----------+---+
| Alice| 25|
| Catherine| 29|
+----------+---+
In this example, both rows with any NULL values are removed.

Example 2: Dropping Rows with NULLs in Specific Columns:If you only want to drop rows 
with NULL values in specific columns, you can use the subset parameter.

# Drop rows where 'Age' is NULL
df_dropped_subset = df.dropna(subset=["Age"])
# Show the result
df_dropped_subset.show()
Output:

+----------+---+
| Name|Age|
+----------+---+
| Alice| 25|
| Catherine| 29|
| David| 22|
+----------+---+
In this case, rows are dropped only if the Age column contains a NULL value.

4. Filling NULL Values: Another common way to handle NULL values is by replacing them with 
a specific value using the fillna() method. This is useful when you want to fill missing 
values with a default or placeholder value, such as 0, an empty string, or the mean of the 
column.

Syntax: DataFrame.fillna(value, subset=None)
Parameters:
value: The value used to replace NULL values. Can be a scalar or a dictionary mapping columns 
to values.
subset: Optional. A list of column names where you want to fill NULL values.
Example 1: Filling NULL Values with a Scalar: You can fill all NULL values in a column with 
a single scalar value.

# Fill NULL values in 'Age' with 0
df_filled = df.fillna(0, subset=["Age"])
# Show the result
df_filled.show()
Output:

+----------+---+
| Name|Age|
+----------+---+
| Alice| 25|
| Bob| 0|
| Catherine| 29|
| David| 22|
+----------+---+
In this example, the NULL values in the Age column are replaced with 0.

Example 2: Filling NULL Values with Different Values per Column: You can also provide a 
dictionary to specify different fill values for different columns.

# Fill NULL values in 'Age' with 0 and 'Name' with 'Unknown'
df_filled_multiple = df.fillna({"Age": 0, "Name": "Unknown"})
# Show the result
df_filled_multiple.show()
Output:

+--------+---+
| Name|Age|
+--------+---+
| Alice| 25|
| Bob| 0|
|Catherine| 29|
| Unknown| 22|
+--------+---+
Here, NULL values in the Age column are filled with 0, and NULL values in the Name column 
are filled with “Unknown”.

5. Replacing NULL Values: In some cases, you may want to replace NULL values conditionally, 
such as replacing them based on existing non-NULL values in other columns or a predefined 
condition. You can use the na.replace() method to replace NULL values with specific values.

Example: Replacing NULL Values.

# Replace NULL values in 'Name' column with 'Anonymous'
df_replaced = df.na.replace({None: "Anonymous"}, subset=["Name"])
# Show the result
df_replaced.show()
Output:

+----------+---+
| Name|Age|
+----------+---+
| Alice| 25|
| Anonymous|null|
| Catherine| 29|
| Anonymous| 22|
+----------+---+
In this example, the NULL values in the Name column are replaced with the string “Anonymous”.
Conclusion: Handling NULL values in PySpark is essential for ensuring data quality and 
integrity in your data pipeline. PySpark provides powerful methods to detect, filter, drop, 
fill, and replace NULL values in DataFrames, enabling you to clean your data efficiently. 
Whether you’re preparing data for analysis, reporting, or machine learning models, 
handling NULL values correctly will improve the accuracy and performance of your operations.



Comments

Popular posts from this blog

JPS in Hadoop

What is JPS in Hadoop? For checking running process in our Hadoop cluster we use JPS command. JPS  stands for Java  Virtual Machine Process Status Tool  or [JVM Process Status tool]. Below are some important points which one should remember at the time of using JPS command. To check all running nodes on the host via jps, you need to run the command as r oot. Otherwise, jps will only show nodes which you have currently logged-in user as. Example of JPS What-is-JPS-in-Hadoop By Prajjwal

How write operation done in HDFS?

How write operation done in HDFS?  HDFS follows Write once Read many model, s o we can't edit files which are already present in HDFS. Syntax to write data in HDFS:  hdfs dfs -put <local/file/path> <HDFS/location/where file needs to write>  exmple:         hdfs dfs -put /home/prajjwal/file1.txt  /landing_location/ To write a file in HDFS, a client needs to interact with master i.e. namenode (master). Namenode provides the address of the datanodes (slaves) on which client will start writing the data. Client can directly write data on the datanodes, now datanode will create data write pipeline. The first D ataNode will copy the block to another datanode, which intern copy it to the third datanode. Once it creates the replicas of blocks, it sends back the acknowledgment. We can understand with the help of below cartoon diagram. Thanks All.

How to remove duplicates from data?

Removing Duplicates from Production Data in Real-Time Using SQL Handling duplicates in production data requires efficient strategies to maintain data integrity and avoid system performance issues. Here’s a structured approach to achieve this:   --- ### **1. Prevention: Use Unique Constraints**   The best way to deal with duplicates is to prevent them. Ensure your database schema is designed to enforce uniqueness:   - **Primary Key**: Define a primary key to prevent identical rows.   - **Unique Constraints**: Apply unique constraints to columns or combinations of columns that should not contain duplicate values.   **Example:**   ```sql ALTER TABLE my_table  ADD CONSTRAINT unique_constraint_name UNIQUE (column1, column2); ``` --- ### **2. Identifying Duplicates**   Before removing duplicates, identify them using `GROUP BY` and `HAVING`:   **Example:**   ```sql SELECT column1, column2, COUNT(*) AS duplicate_count FROM my_table GROUP BY colu...