theCodingInterface

How to Unnest Multi-Valued Array Fields in PySpark using Explode

By Adam McQuistan in Data Engineering 04/30/2021 Comment

Introduction

In this How To article I will show a simple example of how to use the explode function from the SparkSQL API to unravel multi-valued fields. I have found this to be a pretty common use case when doing data cleaning using PySpark, particularly when working with nested JSON documents in an Extract Transform and Load workflow.

Understanding the Data Problem

First things first I want to make sure everyone understands the type of data structure I'm talking about here so, I'll concoct this dummy dataset representing parents by name and state of residence along with a multi-valued array field of children objects.

parents = [
    (
        "John",
        "NE",
        [
            {"child": "Jimmy", "dob": "2010-10-12", "pet": "dog"},
            {"child": "Billy", "dob": "2012-09-07"}
        ]
    ),
    (
        "Jane",
        "IA",
        [
            {"child": "Sally", "dob": "2008-08-19"},
            {"child": "Tim", "dob": "2013-09-15"}
        ]
    ),
    (
        "Sue",
        "IA",
        [
            {"child": "Cameron", "dob": "2009-11-21", "pet": "cat"}
        ]
    ),
]

I'll leave the data as standard Python types as opposed to persisting it as JSON then reading it back into a PySpark DataFrame just for ease of reproducibility and expressiveness.

Next I will generate a schema as well as a DataFrame constructed from the schema and parents data then print out the schema to verify the data structure.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

spark = SparkSession.builder.appName('SparkNestedFields').getOrCreate()

schema = StructType([
    StructField("parent", StringType()),
    StructField("state", StringType()),
    StructField("children", ArrayType(
        StructType([
            StructField("child", StringType()),
            StructField("dob", StringType()),
            StructField("pet", StringType())
        ])
    ))
])

parents_df = spark.createDataFrame(data=parents, schema=schema)

parents_df.printSchema()

root
 |-- parent: string (nullable = true)
 |-- state: string (nullable = true)
 |-- children: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- child: string (nullable = true)
 |    |    |-- dob: string (nullable = true)
 |    |    |-- pet: string (nullable = true)

I'll also go ahead and show the DataFrame contents to be sure everyone is on the same page.

parents_df.show(truncate=False)

+------+-----+------------------------------------------------+
|parent|state|children                                        |
+------+-----+------------------------------------------------+
|John  |NE   |[[Jimmy, 2010-10-12, dog], [Billy, 2012-09-07,]]|
|Jane  |IA   |[[Sally, 2008-08-19,], [Tim, 2013-09-15,]]      |
|Sue   |IA   |[[Cameron, 2009-11-21, cat]]                    |
+------+-----+------------------------------------------------+

Using the explode Function to Unravel the Nested Field

Alright, so everyone should now be clear on the type and structure of the source data. What I'd like to do is unravel that children field so that I end up with an expanded DataFrame with the columns parent, state, child, dob, and pet. To accomplish this I'll use the explode(...) function, which for me, the syntax from the docs had proven to be a bit hard for me to wrap my head around at first. Therefore, I'm hoping this example will help others with a similar nested field problem who stumble upon the docs only to find themselves scratching their heads and waving their fists in frusteration like I did.

To start I will simply use the explode(...) function with the withColumn(...) function of the DataFrame to explode out the children field to a new struct column "children_exploded" which results in a new DataFrame where each nested struct in the original children array field becomes a new row.

children_exploded = parents_df.withColumn("children_exploded", F.explode("children"))
children_exploded.show(truncate=False)

+------+-----+------------------------------------------------+--------------------------+
|parent|state|children                                        |children_exploded         |
+------+-----+------------------------------------------------+--------------------------+
|John  |NE   |[[Jimmy, 2010-10-12, dog], [Billy, 2012-09-07,]]|[Jimmy, 2010-10-12, dog]  |
|John  |NE   |[[Jimmy, 2010-10-12, dog], [Billy, 2012-09-07,]]|[Billy, 2012-09-07,]      |
|Jane  |IA   |[[Sally, 2008-08-19,], [Tim, 2013-09-15,]]      |[Sally, 2008-08-19,]      |
|Jane  |IA   |[[Sally, 2008-08-19,], [Tim, 2013-09-15,]]      |[Tim, 2013-09-15,]        |
|Sue   |IA   |[[Cameron, 2009-11-21, cat]]                    |[Cameron, 2009-11-21, cat]|
+------+-----+------------------------------------------------+--------------------------+

And its always helpful to take a look at the schema also.

children_exploded.printSchema()

root
 |-- parent: string (nullable = true)
 |-- state: string (nullable = true)
 |-- children: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- child: string (nullable = true)
 |    |    |-- dob: string (nullable = true)
 |    |    |-- pet: string (nullable = true)
 |-- children_exploded: struct (nullable = true)
 |    |-- child: string (nullable = true)
 |    |-- dob: string (nullable = true)
 |    |-- pet: string (nullable = true)

Now I can select the fields of the children_exploded struct column into their own columns in the DataFrame like shown below and viola, my job is done!

children_exploded = children_exploded.select(
                                "parent",
                                "state",
                                "children_exploded.child",
                                "children_exploded.dob",
                                "children_exploded.pet")
children_exploded.show(truncate=False)

+------+-----+-------+----------+----+
|parent|state|child  |dob       |pet |
+------+-----+-------+----------+----+
|John  |NE   |Jimmy  |2010-10-12|dog |
|John  |NE   |Billy  |2012-09-07|null|
|Jane  |IA   |Sally  |2008-08-19|null|
|Jane  |IA   |Tim    |2013-09-15|null|
|Sue   |IA   |Cameron|2009-11-21|cat |
+------+-----+-------+----------+----+

I'll show this once more as a succinct chained set of operations as well.

parents_df.withColumn("children_exploded", F.explode("children"))\
            .select("parent",
                    "state",
                    "children_exploded.child",
                    "children_exploded.dob",
                    "children_exploded.pet")\
            .show(truncate=False)

+------+-----+-------+----------+----+
|parent|state|child  |dob       |pet |
+------+-----+-------+----------+----+
|John  |NE   |Jimmy  |2010-10-12|dog |
|John  |NE   |Billy  |2012-09-07|null|
|Jane  |IA   |Sally  |2008-08-19|null|
|Jane  |IA   |Tim    |2013-09-15|null|
|Sue   |IA   |Cameron|2009-11-21|cat |
+------+-----+-------+----------+----+

Resources to Learn More About High Performance Spark Computing

A fantastic book on Modern Spark computing and one that has been instrumental in developing my understanding of programming in Spark is Learning Spark: Lightening-Fast Data Analytics
If you are looking for an introductory article on PySpark or for some more examples to add to your PySpark toolset please have a look at my article Example Driven High Level Overview of Spark with Python

thecodinginterface.com earns commision from sales of linked products such as the book suggestions above. This enables providing high quality, frequent, and most importantly FREE tutorials and content for readers interested in Software Engineering so, thank you for supporting the authors of these resources as well as thecodinginterface.com

Conclusion

In this short How To article I demonstrate the syntax and usefulness of the PySpark explode(...) function from the SparkSQL API for unraveling nested multi-valued fields. If you are looking for an introductory article on PySpark or for some more examples to add to your PySpark toolset please have a look at my article Example Driven High Level Overview of Spark with Python

As always, thanks for reading and please do not hesitate to critique or comment below.

PySpark Data Engineering Python

Share with friends and colleagues

[[ likes ]] likes