3

I have a Spark dataframe:

id objects
1 [sun, solar system, mars, milky way]
2 [moon, cosmic rays, orion nebula]

I need to replace space with underscore in array elements.

Expected result:

id objects concat_obj
1 [sun, solar system, mars, milky way] [sun, solar_system, mars, milky_way]
2 [moon, cosmic rays, orion nebula] [moon, cosmic_rays, orion_nebula]

I tried using regexp_replace:

df = df.withColumn('concat_obj', regexp_replace('objects', ' ', '_'))

but that changed all spaces to underscores while I need to replace spaces only inside array elements.
So, how can this be done in PySpark?

1

2 Answers 2

2

Use highe order functions to replace white space through regexp_replace

schema

root
 |-- id: long (nullable = true)
 |-- objects: array (nullable = true)
 |    |-- element: string (containsNull = true)

solution

df.withColumn('concat_obj', expr("transform(objects, x-> regexp_replace(x,' ','_'))")).show(truncate=False)

+---+------------------------------------+------------------------------------+
|id |objects                             |concat_obj                          |
+---+------------------------------------+------------------------------------+
|1  |[sun, solar system, mars, milky way]|[sun, solar_system, mars, milky_way]|
|2  |[moon, cosmic rays, orion nebula]   |[moon, cosmic_rays, orion_nebula]   |
+---+------------------------------------+------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

0

You could use the following regex:

`(?<=[A-Za-z]) `

The only difference with respect to your code is that this pattern checks whether before the space there is an alphabetical character.

Try it here.

3 Comments

I got the following error: ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 246, Column 1: Assignment conversion not possible from type "org.apache.spark.sql.catalyst.util.ArrayData" to type "org.apache.spark.unsafe.types.UTF8String"
If you can provide a debugging environment with your code, I may help you further. I have no possibility of playing with pyspark at the moment. @red_quark
At the moment, I solved the problem in a different way by converting the array to a string and applying regexp_replace. But for the future, I'm still interested how to get the desired result without pre-converting the array to a string..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.