PySpark fillna() & fill() – Replace NULL Values

In PySpark, DataFrame.fillna() or DataFrameNaFunctions.fill() is used to replace NULL values on the DataFrame columns with either with zero(0), empty string, space, or any constant literal values.

While working on PySpark DataFrame we often need to replace null values as certain operations on null values return NullpointerException hence, we need to graciously handle nulls as the first step before processing. Also, while writing to a file, it’s always best practice to replace null values, not doing this result nulls on the output file.

As part of the cleanup, some times you may need to Drop Rows with NULL Values in PySpark DataFrame and Filter Rows by checking IS NULL/NOT NULL conditions.

In this article, I will use both fill() and fillna() to replace null values with an empty string, constant value, and zero(0) on Dataframe columns integer, string with Python examples.

Before we start, Let’s read a CSV into PySpark DataFrame file, where we have no values on certain rows of String and Integer columns, PySpark assigns null values to these no value columns.

The file we are using here is available at GitHub small_zipcode.csv


from pyspark.sql import SparkSession
spark = SparkSession.builder 
    .master("local[1]") 
    .appName("SparkByExamples.com") 
    .getOrCreate()

filePath="resources/small_zipcode.csv"
df = spark.read.options(header='true', inferSchema='true') 
          .csv(filePath)

df.printSchema()
df.show(truncate=False)

This yields the below output. As you see columns type, city and population columns have null values.


+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Now, let’s see how to replace these null values.

PySpark fillna() & fill() Syntax

PySpark provides DataFrame.fillna() and DataFrameNaFunctions.fill() to replace NUL/None values. These two are aliases of each other and returns the same results.


fillna(value, subset=None)
fill(value, subset=None)
  • value – Value should be the data type of int, long, float, string, or dict. Value specified here will be replaced for NULL/None values.
  • subset – This is optional, when used it should be the subset of the column names where you wanted to replace NULL/None values.

PySpark Replace NULL Values with Zero (0)

PySpark fill(value:Long) signatures that are available in DataFrameNaFunctions is used to replace NULL values with numeric values either zero(0) or any constant value for all integer and long datatype columns of PySpark DataFrame or Dataset.


df.na.fill(value=0).show()
df.na.fill(value=0,subset=["population"]).show()

Above both statements yields the same below output. Note that it replaces only Integer columns since our value is 0.


+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|null               |PR   |30100     |
|2  |704    |null    |PASEO COSTA DEL SUR|PR   |0         |
|3  |709    |null    |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|null               |TX   |0         |
+---+-------+--------+-------------------+-----+----------+

PySpark Replace Null Values with Empty String

Now let’s see how to replace NULL/None values with an empty string or any constant values String on DataFrame columns.


df.na.fill("").show(false)

Yields below output. This replaces all String type columns with empty/blank string for all NULL values.


+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|                   |PR   |30100     |
|2  |704    |        |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |        |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|                   |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Now, let’s replace NULLs on specific columns, below example replace column type with empty string and column city with value “unknown”.


df.na.fill("unknown",["city"]) 
    .na.fill("",["type"]).show()

Yields below output. This replaces null values with an empty string for type column and replaces with a constant value “unknown” for city column.


+---+-------+--------+-------------------+-----+----------+
|id |zipcode|type    |city               |state|population|
+---+-------+--------+-------------------+-----+----------+
|1  |704    |STANDARD|unknown            |PR   |30100     |
|2  |704    |        |PASEO COSTA DEL SUR|PR   |null      |
|3  |709    |        |BDA SAN LUIS       |PR   |3700      |
|4  |76166  |UNIQUE  |CINGULAR WIRELESS  |TX   |84000     |
|5  |76177  |STANDARD|unknown            |TX   |null      |
+---+-------+--------+-------------------+-----+----------+

Alternatively you can also write the above statement as


df.na.fill({"city": "unknown", "type": ""}) 
    .show()

Complete Code

Below is complete code with Scala example. You can use it by copying it from here or use the GitHub to download the source code.


from pyspark.sql import SparkSession
spark = SparkSession.builder 
    .master("local[1]") 
    .appName("SparkByExamples.com") 
    .getOrCreate()

filePath="resources/small_zipcode.csv"
df = spark.read.options(header='true', inferSchema='true') 
          .csv(filePath)

df.printSchema()
df.show(truncate=False)


df.fillna(value=0).show()
df.fillna(value=0,subset=["population"]).show()
df.na.fill(value=0).show()
df.na.fill(value=0,subset=["population"]).show()


df.fillna(value="").show()
df.na.fill(value="").show()

df.fillna("unknown",["city"]) 
    .fillna("",["type"]).show()

df.fillna({"city": "unknown", "type": ""}) 
    .show()

df.na.fill("unknown",["city"]) 
    .na.fill("",["type"]).show()

df.na.fill({"city": "unknown", "type": ""}) 
    .show()

Conclusion

In this PySpark article, you have learned how to replace null values with zero or an empty string on integer and string columns respectively using fill() and fillna() transformation functions.

Thanks for reading. If you recognize my effort or like articles here please do comment or provide any suggestions for improvements in the comments sections! 

Happy Learning !!

Reference:

Author: Shantun Parmar

4 thoughts on “PySpark fillna() & fill() – Replace NULL Values

  1. Thank you for every one of your hard work on this web page. My aunt really loves making time for investigations and it’s really obvious why. Most of us learn all of the lively mode you offer priceless suggestions through this web site and even increase participation from some other people on the matter so our favorite girl is now starting to learn a lot of things. Take advantage of the rest of the year. You are always doing a remarkable job.

  2. My wife and i were very satisfied when Peter managed to round up his survey while using the precious recommendations he received when using the weblog. It is now and again perplexing to simply possibly be offering tips which usually the others might have been trying to sell. We really see we now have you to thank for this. These illustrations you made, the straightforward website navigation, the relationships your site make it possible to promote – it is mostly terrific, and it is making our son in addition to us recognize that this theme is satisfying, which is certainly unbelievably vital. Thanks for the whole thing!

  3. My husband and i felt absolutely ecstatic Edward could complete his inquiry from the ideas he discovered from your very own site. It’s not at all simplistic just to be giving for free hints that men and women could have been making money from. And we do know we have the blog owner to appreciate because of that. Most of the explanations you’ve made, the easy blog menu, the friendships you assist to engender – it’s everything exceptional, and it’s assisting our son and our family do think that content is cool, and that’s extremely fundamental. Many thanks for all the pieces!

Thanks for your support, You may click on ads to encourage us which assits to writers.

Leave a Reply

Your email address will not be published.