0

I have a column in dataframe which has string values like

"Hardware part not present"
"Software part not present"
null
null

I want to split wrt " " and take only first 2 strings to new column and if it is null then even new column value should be null as well. how to achieve this?

result needed

column                               New column
Hardware part not present           Hardware part
Software part not present           Software part
null                                null
null                                null

how to achieve this using pyspark or python

3
  • How many columns do you need to rename in your application? If <5 I dont think the added complexity is worth it when you can simply rename with df.rename(columns....) Commented Sep 30, 2022 at 13:16
  • You can use the split method for regular strings and a simple condition for null values Commented Sep 30, 2022 at 13:17
  • how to split like fater fist two spaces and take 0th index value Commented Sep 30, 2022 at 13:22

3 Answers 3

2

You can use the substring_index function.

import pyspark.sql.functions as F

......
df = df.withColumn('New column', F.substring_index('column', ' ', 2))
Sign up to request clarification or add additional context in comments.

1 Comment

Great answer! Straight to the point!
0

Pandas has a built in split method. Here you can define the total number of splits to limit how deep it goes into the string.

df[“existingcol”].str.split(n=2, expand=true)

This will give you 3 columns. Then just concat the first 2, and then drop any unnecessary cols.

Doco for reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html

It defaults to splitting on white space, but if you think there’ll be a comma or something in there, you can always split on a regex pattern.

1 Comment

The question is for PySpark, not for Pandas.
0

In pyspark, you can achieve this using concat_ws, slice and split functions.

import pyspark.sql.functions as func

data_sdf. \
    withColumn('text_frst2', 
               func.when(func.col('text').isNotNull(), 
                         func.concat_ws(' ', func.slice(func.split('text', ' '), 1, 2))
                         )
               ). \
    show(truncate=False)

# +----------------------------+-------------+
# |text                        |text_frst2   |
# +----------------------------+-------------+
# |software part is not present|software part|
# |hardware part is not present|hardware part|
# |null                        |null         |
# |foo bar baz                 |foo bar      |
# +----------------------------+-------------+
  • split will split the text based on the provided delimiter (in this case " ")
  • slice will retain N number of elements starting from Kth position (in this case N=2 and K=1)
  • concat_ws concatenates the array elements delimited by the provided delimiter (in this case " ")
  • I used a when() to only use the operations on non-null values as this generates a space/blank value for null

4 Comments

Hi Samkart, i was using similar command : new_df=Flag_df.withColumn('error_part', func.when(func.col('CertificationVariant_errors').isNotNull(), func.concat_ws(' ', func.slice(func.split('CertificationVariant_errors', ' '), 1,2))))
its saying "NameError: name 'func' is not defined"
got it imported function as func and its working.. thnaks for your help
@HarshithKR - yes, i added that as well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.