Getting Started with PySpark UDF(User Defined Function)

Published in

Analytics Vidhya

4 min readJan 4, 2021

What is UDF ?

A User Defined Function is a custom function defined to perform transformation operations on Pyspark dataframes. Once defined it can be re-used with multiple dataframes. It can also be used as an alternative of for-loops for faster performance.

Why is UDF Needed?

UDF can be used to perform data transformation operations which are not already present in Pyspark built-in functionality. For instance, we have a column with string values and we need to create a new column with string values reversed. There is no built-in function to perform this operation and we need to write a custom function.
UDF can be used as an alternate of for-loops since they are much more faster due to parallel processing unlike for-loops which performs step-by-step iteration. For instance, let’s say you have a dataframe column with 100 rows consisting string values. Now you need to store reverse string values in a new column. There is no built-in function to perform this operation, so let’s look at how we can approach the problem.
1. For Loop:- Iterate over each and every 100 rows one by one and perform the desired operation. Since the iteration will execute step by step, it takes a lot of time to execute.
2. UDF:- Define a custom function(UDF) to perform the operation. Since the operation on each of these 100 rows is independent of each other, UDF can execute the operation in parallel and will execute much faster than for-loops.

Creating a UDF in PySpark

Let us consider ‘a dataframe column with customer’s first and last name. We need to create a new column with first letter of both words converted to upper case’. Let us understand how to create a new column with the desired transformation using UDF.

Create a PySpark DataFrame

Create a Custom Function

Let’s create a custom function which takes the customer name and return the first letters converted to upper case.

Register a PySpark UDF

Create a PySpark UDF by using the pyspark udf() function. It takes 2 arguments, the custom function and the return datatype(the data type of value returned by custom function. It is StringType() by default).

Calling a PySpark UDF

Finally create a new column and perform the desired transformation by calling the UDF(The withColumn() function is used to create a new column or transform an existing column values).

Finally we have a new column with the desired result. This way you can perform any custom operations by defining a UDF. Lets look at some more interesting use cases.

More use cases of PySpark UDF

Let us look at a few more cases where UDFs can be very handy.

Timezone Conversion

Consider a scenario where a dataframe has two columns, one with local timestamp and the other with the local time zone.

Let’s create a custom function to convert the timestamp to UTC according to the time zone.

Note that in the above example, we passed more than 1 columns as arguments to the UDF. This way we can create a new column by using multiple columns.

Imputing Missing Values with Mean

Consider a dataframe with some missing values in columns. We need to impute the missing values with the mean value of the columns.

In examples till now, we have seen that we create/update one column at a time using UDF. Now since we need to impute null values with the mean of every column, we will create a UDF and run it over every column of the dataframe.

The NULL values of every column gets imputed with the mean value of that column.

Conclusion

In this article we learned the following
1. UDFs can be very handy when we need to perform a transformation on a PySpark dataframe.
2. Once defined can be re-used with multiple dataframes.
3. Faster than for-loops to perform iterations over a dataframe.