1

Here is my code:

from pyspark.sql import *

department1 = Row(id='123456', name='Computer Science')
department2 = Row(id='789012', name='Mechanical Engineering')

Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('michael', 'armbrust', '[email protected]', 100000)
employee2 = Employee('xiangrui', 'meng', '[email protected]', 120000)


departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
departmentWithEmployees2 = Row(department=department2, employees=[employee1, employee2])


departmentsWithEmployeesSeq1 = [departmentWithEmployees1, departmentWithEmployees2]
df1 = spark.createDataFrame(departmentsWithEmployeesSeq1)

I want to join the firstName and lastName inside the array.

from pyspark.sql import functions as sf
df2 = df1.withColumn("employees.FullName", sf.concat(sf.col('employees.firstName'), sf.col('employees.lastName')))
df2.printSchema()

root
 |-- department: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- salary: long (nullable = true)
 |-- employees.FullName: array (nullable = true)
 |    |-- element: string (containsNull = true)

My new column FullName is at the parent level, how to put them in array like.

root
 |-- department: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- salary: long (nullable = true)
 |    |    |-- FullName: string (containsNull = true)

1 Answer 1

4

One way to do this is to explode ur array of structs, using inline_outer, and use concat_ws to get your full name and assemble all of them using array,struct.

from pyspark.sql import functions as F

df1.selectExpr("department","""inline_outer(employees)""")\
   .withColumn("FullName", F.concat_ws(" ","firstName","lastName"))\
   .select("department", F.array(F.struct(*[F.col(x).alias(x) for x in\
                                     ['firstName','lastName','email','salary','FullName']]))\
           .alias("employees")).printSchema()

#root
 #|-- department: struct (nullable = true)
 #|    |-- id: string (nullable = true)
 #|    |-- name: string (nullable = true)
 #|-- employees: array (nullable = false)
 #|    |-- element: struct (containsNull = false)
 #|    |    |-- firstName: string (nullable = true)
 #|    |    |-- lastName: string (nullable = true)
 #|    |    |-- email: string (nullable = true)
 #|    |    |-- salary: long (nullable = true)
 #|    |    |-- FullName: string (nullable = false)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your quick response @murtihash. Can you suggest me some good resource to build/practise this kind of complex query.
@PrakashKumar theres no real short and easy resource to get well versed in this except to practice everyday, and the way i did it was by answering questions on here, u can use databricks community edition free cluster to prototype ur code and practice, also follow top contributors to pyspark and their answers

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.