Replace spaces with underscores inside array elements in PySpark

Question

I have a Spark dataframe:

id	objects
1	[sun, solar system, mars, milky way]
2	[moon, cosmic rays, orion nebula]

I need to replace space with underscore in array elements.

Expected result:

id	objects	concat_obj
1	[sun, solar system, mars, milky way]	[sun, solar_system, mars, milky_way]
2	[moon, cosmic rays, orion nebula]	[moon, cosmic_rays, orion_nebula]

I tried using regexp_replace:

df = df.withColumn('concat_obj', regexp_replace('objects', ' ', '_'))

but that changed all spaces to underscores while I need to replace spaces only inside array elements.
So, how can this be done in PySpark?

See if this helps: [transform] (spark.apache.org/docs/latest/api/python/reference/api/…) and on the similar lines stackoverflow.com/questions/51706383/… — teedak8s
– teedak8s, Commented Jun 5, 2022 at 17:14

wwnde · Accepted Answer · 2022-06-05 21:21:14Z

2

Use highe order functions to replace white space through regexp_replace

schema

root
 |-- id: long (nullable = true)
 |-- objects: array (nullable = true)
 |    |-- element: string (containsNull = true)

solution

df.withColumn('concat_obj', expr("transform(objects, x-> regexp_replace(x,' ','_'))")).show(truncate=False)

+---+------------------------------------+------------------------------------+
|id |objects                             |concat_obj                          |
+---+------------------------------------+------------------------------------+
|1  |[sun, solar system, mars, milky way]|[sun, solar_system, mars, milky_way]|
|2  |[moon, cosmic rays, orion nebula]   |[moon, cosmic_rays, orion_nebula]   |
+---+------------------------------------+------------------------------------+

answered Jun 5, 2022 at 21:21

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

lemon · Accepted Answer · 2022-06-05 17:18:55Z

0

You could use the following regex:

`(?<=[A-Za-z]) `

The only difference with respect to your code is that this pattern checks whether before the space there is an alphabetical character.

Try it here.

answered Jun 5, 2022 at 17:18

lemon

15.7k6 gold badges24 silver badges42 bronze badges

3 Comments

red_quark Over a year ago

I got the following error:

ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 246, Column 1: Assignment conversion not possible from type "org.apache.spark.sql.catalyst.util.ArrayData" to type "org.apache.spark.unsafe.types.UTF8String"

lemon Over a year ago

If you can provide a debugging environment with your code, I may help you further. I have no possibility of playing with pyspark at the moment. @red_quark

red_quark Over a year ago

At the moment, I solved the problem in a different way by converting the array to a string and applying regexp_replace. But for the future, I'm still interested how to get the desired result without pre-converting the array to a string..

Collectives™ on Stack Overflow

Replace spaces with underscores inside array elements in PySpark

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related