In this How To article I will show a simple example of how to use the explode function from the SparkSQL API to unravel multi-valued fields. I have found this to be a pretty common use case when doing data cleaning using PySpark, particularly when working with nested JSON documents in an Extract Transform and Load workflow.
First things first I want to make sure everyone understands the type of data structure I'm talking about here so, I'll concoct this dummy dataset representing parents by name and state of residence along with a multi-valued array field of children objects.
parents = [
(
"John",
"NE",
[
{"child": "Jimmy", "dob": "2010-10-12", "pet": "dog"},
{"child": "Billy", "dob": "2012-09-07"}
]
),
(
"Jane",
"IA",
[
{"child": "Sally", "dob": "2008-08-19"},
{"child": "Tim", "dob": "2013-09-15"}
]
),
(
"Sue",
"IA",
[
{"child": "Cameron", "dob": "2009-11-21", "pet": "cat"}
]
),
]
I'll leave the data as standard Python types as opposed to persisting it as JSON then reading it back into a PySpark DataFrame just for ease of reproducibility and expressiveness.
Next I will generate a schema as well as a DataFrame constructed from the schema and parents data then print out the schema to verify the data structure.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
spark = SparkSession.builder.appName('SparkNestedFields').getOrCreate()
schema = StructType([
StructField("parent", StringType()),
StructField("state", StringType()),
StructField("children", ArrayType(
StructType([
StructField("child", StringType()),
StructField("dob", StringType()),
StructField("pet", StringType())
])
))
])
parents_df = spark.createDataFrame(data=parents, schema=schema)
parents_df.printSchema()
root |-- parent: string (nullable = true) |-- state: string (nullable = true) |-- children: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- child: string (nullable = true) | | |-- dob: string (nullable = true) | | |-- pet: string (nullable = true)
I'll also go ahead and show the DataFrame contents to be sure everyone is on the same page.
parents_df.show(truncate=False)
+------+-----+------------------------------------------------+ |parent|state|children | +------+-----+------------------------------------------------+ |John |NE |[[Jimmy, 2010-10-12, dog], [Billy, 2012-09-07,]]| |Jane |IA |[[Sally, 2008-08-19,], [Tim, 2013-09-15,]] | |Sue |IA |[[Cameron, 2009-11-21, cat]] | +------+-----+------------------------------------------------+
Alright, so everyone should now be clear on the type and structure of the source data. What I'd like to do is unravel that children field so that I end up with an expanded DataFrame with the columns parent, state, child, dob, and pet. To accomplish this I'll use the explode(...) function, which for me, the syntax from the docs had proven to be a bit hard for me to wrap my head around at first. Therefore, I'm hoping this example will help others with a similar nested field problem who stumble upon the docs only to find themselves scratching their heads and waving their fists in frusteration like I did.
To start I will simply use the explode(...) function with the withColumn(...) function of the DataFrame to explode out the children field to a new struct column "children_exploded" which results in a new DataFrame where each nested struct in the original children array field becomes a new row.
children_exploded = parents_df.withColumn("children_exploded", F.explode("children"))
children_exploded.show(truncate=False)
+------+-----+------------------------------------------------+--------------------------+ |parent|state|children |children_exploded | +------+-----+------------------------------------------------+--------------------------+ |John |NE |[[Jimmy, 2010-10-12, dog], [Billy, 2012-09-07,]]|[Jimmy, 2010-10-12, dog] | |John |NE |[[Jimmy, 2010-10-12, dog], [Billy, 2012-09-07,]]|[Billy, 2012-09-07,] | |Jane |IA |[[Sally, 2008-08-19,], [Tim, 2013-09-15,]] |[Sally, 2008-08-19,] | |Jane |IA |[[Sally, 2008-08-19,], [Tim, 2013-09-15,]] |[Tim, 2013-09-15,] | |Sue |IA |[[Cameron, 2009-11-21, cat]] |[Cameron, 2009-11-21, cat]| +------+-----+------------------------------------------------+--------------------------+
And its always helpful to take a look at the schema also.
children_exploded.printSchema()
root |-- parent: string (nullable = true) |-- state: string (nullable = true) |-- children: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- child: string (nullable = true) | | |-- dob: string (nullable = true) | | |-- pet: string (nullable = true) |-- children_exploded: struct (nullable = true) | |-- child: string (nullable = true) | |-- dob: string (nullable = true) | |-- pet: string (nullable = true)
Now I can select the fields of the children_exploded struct column into their own columns in the DataFrame like shown below and viola, my job is done!
children_exploded = children_exploded.select(
"parent",
"state",
"children_exploded.child",
"children_exploded.dob",
"children_exploded.pet")
children_exploded.show(truncate=False)
+------+-----+-------+----------+----+ |parent|state|child |dob |pet | +------+-----+-------+----------+----+ |John |NE |Jimmy |2010-10-12|dog | |John |NE |Billy |2012-09-07|null| |Jane |IA |Sally |2008-08-19|null| |Jane |IA |Tim |2013-09-15|null| |Sue |IA |Cameron|2009-11-21|cat | +------+-----+-------+----------+----+
I'll show this once more as a succinct chained set of operations as well.
parents_df.withColumn("children_exploded", F.explode("children"))\
.select("parent",
"state",
"children_exploded.child",
"children_exploded.dob",
"children_exploded.pet")\
.show(truncate=False)
+------+-----+-------+----------+----+ |parent|state|child |dob |pet | +------+-----+-------+----------+----+ |John |NE |Jimmy |2010-10-12|dog | |John |NE |Billy |2012-09-07|null| |Jane |IA |Sally |2008-08-19|null| |Jane |IA |Tim |2013-09-15|null| |Sue |IA |Cameron|2009-11-21|cat | +------+-----+-------+----------+----+
thecodinginterface.com earns commision from sales of linked products such as the book suggestions above. This enables providing high quality, frequent, and most importantly FREE tutorials and content for readers interested in Software Engineering so, thank you for supporting the authors of these resources as well as thecodinginterface.com
In this short How To article I demonstrate the syntax and usefulness of the PySpark explode(...) function from the SparkSQL API for unraveling nested multi-valued fields. If you are looking for an introductory article on PySpark or for some more examples to add to your PySpark toolset please have a look at my article Example Driven High Level Overview of Spark with Python
As always, thanks for reading and please do not hesitate to critique or comment below.