Apply withColumn on pyspark array

Question

Here is my code:

from pyspark.sql import *

department1 = Row(id='123456', name='Computer Science')
department2 = Row(id='789012', name='Mechanical Engineering')

Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('michael', 'armbrust', '[email protected]', 100000)
employee2 = Employee('xiangrui', 'meng', '[email protected]', 120000)


departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
departmentWithEmployees2 = Row(department=department2, employees=[employee1, employee2])


departmentsWithEmployeesSeq1 = [departmentWithEmployees1, departmentWithEmployees2]
df1 = spark.createDataFrame(departmentsWithEmployeesSeq1)

I want to join the firstName and lastName inside the array.

from pyspark.sql import functions as sf
df2 = df1.withColumn("employees.FullName", sf.concat(sf.col('employees.firstName'), sf.col('employees.lastName')))
df2.printSchema()

root
 |-- department: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- salary: long (nullable = true)
 |-- employees.FullName: array (nullable = true)
 |    |-- element: string (containsNull = true)

My new column FullName is at the parent level, how to put them in array like.

root
 |-- department: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- salary: long (nullable = true)
 |    |    |-- FullName: string (containsNull = true)

murtihash · Accepted Answer · 2020-05-27 18:01:48Z

4

One way to do this is to explode ur array of structs, using inline_outer, and use concat_ws to get your full name and assemble all of them using array,struct.

from pyspark.sql import functions as F

df1.selectExpr("department","""inline_outer(employees)""")\
   .withColumn("FullName", F.concat_ws(" ","firstName","lastName"))\
   .select("department", F.array(F.struct(*[F.col(x).alias(x) for x in\
                                     ['firstName','lastName','email','salary','FullName']]))\
           .alias("employees")).printSchema()

#root
 #|-- department: struct (nullable = true)
 #|    |-- id: string (nullable = true)
 #|    |-- name: string (nullable = true)
 #|-- employees: array (nullable = false)
 #|    |-- element: struct (containsNull = false)
 #|    |    |-- firstName: string (nullable = true)
 #|    |    |-- lastName: string (nullable = true)
 #|    |    |-- email: string (nullable = true)
 #|    |    |-- salary: long (nullable = true)
 #|    |    |-- FullName: string (nullable = false)

answered May 27, 2020 at 18:01

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Prakash Kumar Over a year ago

Thanks for your quick response @murtihash. Can you suggest me some good resource to build/practise this kind of complex query.

murtihash Over a year ago

@PrakashKumar theres no real short and easy resource to get well versed in this except to practice everyday, and the way i did it was by answering questions on here, u can use databricks community edition free cluster to prototype ur code and practice, also follow top contributors to pyspark and their answers

Collectives™ on Stack Overflow

Apply withColumn on pyspark array

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related