Pyspark: create new column by splitting text

Question

I have a pyspark dataframe like this:

spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac'), 
        (2, '1234ESPNzodiac'),
        (3, '963CNNnonzodiac'), 
        (4, '963CNNzodiac'),
    ],
    ['id', 'col1'] 
)

I would like to create a new column where I split col1 on the words zodiac or nonzodiac, so that I can eventually groupby this new column.

I would like the final output to be like this:

spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac', '1234ESPN'), 
        (2, '1234ESPNzodiac', '1234ESPN'),
        (3, '963CNNnonzodiac', '963CNN'), 
        (4, '963CNNzodiac', '963CNN'),
    ],
    ['id', 'col1', 'col2'] 
)

Czaporka · Accepted Answer · 2020-11-02 19:49:39Z

1

I would use from pyspark.sql.functions import regexp_extract:

df.withColumn("col2", regexp_extract(df.col1, r"([\s\S]+?)(?:non)?zodiac", 1)).show()
+---+-----------------+--------+
| id|             col1|    col2|
+---+-----------------+--------+
|  1|1234ESPNnonzodiac|1234ESPN|
|  2|   1234ESPNzodiac|1234ESPN|
|  3|  963CNNnonzodiac|  963CNN|
|  4|     963CNNzodiac|  963CNN|
+---+-----------------+--------+

answered Nov 2, 2020 at 19:49

Czaporka

2,4363 gold badges13 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark: create new column by splitting text

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related