I have a Spark dataframe:
| id | objects |
|---|---|
| 1 | [sun, solar system, mars, milky way] |
| 2 | [moon, cosmic rays, orion nebula] |
I need to replace space with underscore in array elements.
Expected result:
| id | objects | concat_obj |
|---|---|---|
| 1 | [sun, solar system, mars, milky way] | [sun, solar_system, mars, milky_way] |
| 2 | [moon, cosmic rays, orion nebula] | [moon, cosmic_rays, orion_nebula] |
I tried using regexp_replace:
df = df.withColumn('concat_obj', regexp_replace('objects', ' ', '_'))
but that changed all spaces to underscores while I need to replace spaces only inside array elements.
So, how can this be done in PySpark?