how to split dataframe column value and take only first two strings to new column

Question

I have a column in dataframe which has string values like

"Hardware part not present"
"Software part not present"
null
null

I want to split wrt " " and take only first 2 strings to new column and if it is null then even new column value should be null as well. how to achieve this?

result needed

column                               New column
Hardware part not present           Hardware part
Software part not present           Software part
null                                null
null                                null

how to achieve this using pyspark or python

How many columns do you need to rename in your application? If <5 I dont think the added complexity is worth it when you can simply rename with df.rename(columns....) — Jason Chia
– Jason Chia, Commented Sep 30, 2022 at 13:16
You can use the split method for regular strings and a simple condition for null values — César Debeunne
– César Debeunne, Commented Sep 30, 2022 at 13:17
how to split like fater fist two spaces and take 0th index value — Harshith K R
– Harshith K R, Commented Sep 30, 2022 at 13:22

ZygD · Accepted Answer · 2022-09-30 15:17:43Z

2

You can use the substring_index function.

import pyspark.sql.functions as F

......
df = df.withColumn('New column', F.substring_index('column', ' ', 2))

edited Sep 30, 2022 at 15:17

ZygD

24.8k41 gold badges107 silver badges144 bronze badges

answered Sep 30, 2022 at 14:48

过过招

4,3372 gold badges7 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ZygD Over a year ago

Great answer! Straight to the point!

chromebookdev · Accepted Answer · 2022-09-30 13:31:07Z

0

Pandas has a built in split method. Here you can define the total number of splits to limit how deep it goes into the string.

df[“existingcol”].str.split(n=2, expand=true)

This will give you 3 columns. Then just concat the first 2, and then drop any unnecessary cols.

Doco for reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html

It defaults to splitting on white space, but if you think there’ll be a comma or something in there, you can always split on a regex pattern.

answered Sep 30, 2022 at 13:31

chromebookdev

5145 silver badges17 bronze badges

1 Comment

ZygD Over a year ago

The question is for PySpark, not for Pandas.

samkart · Accepted Answer · 2022-09-30 14:13:01Z

0

In pyspark, you can achieve this using concat_ws, slice and split functions.

import pyspark.sql.functions as func

data_sdf. \
    withColumn('text_frst2', 
               func.when(func.col('text').isNotNull(), 
                         func.concat_ws(' ', func.slice(func.split('text', ' '), 1, 2))
                         )
               ). \
    show(truncate=False)

# +----------------------------+-------------+
# |text                        |text_frst2   |
# +----------------------------+-------------+
# |software part is not present|software part|
# |hardware part is not present|hardware part|
# |null                        |null         |
# |foo bar baz                 |foo bar      |
# +----------------------------+-------------+

split will split the text based on the provided delimiter (in this case " ")
slice will retain N number of elements starting from K^th position (in this case N=2 and K=1)
concat_ws concatenates the array elements delimited by the provided delimiter (in this case " ")
I used a when() to only use the operations on non-null values as this generates a space/blank value for null

edited Sep 30, 2022 at 14:13

answered Sep 30, 2022 at 13:43

samkart

6,7133 gold badges19 silver badges35 bronze badges

4 Comments

Harshith K R Over a year ago

Hi Samkart, i was using similar command : new_df=Flag_df.withColumn('error_part', func.when(func.col('CertificationVariant_errors').isNotNull(), func.concat_ws(' ', func.slice(func.split('CertificationVariant_errors', ' '), 1,2))))

Harshith K R Over a year ago

its saying "NameError: name 'func' is not defined"

Harshith K R Over a year ago

got it imported function as func and its working.. thnaks for your help

samkart Over a year ago

@HarshithKR - yes, i added that as well

Collectives™ on Stack Overflow

how to split dataframe column value and take only first two strings to new column

3 Answers 3

1 Comment

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related