1

want to create a new column based on a string column that have as separator(" ") and skip the split if a digit followed and finally delete ";" in the end if exist using python/pyspark :

Inputs :

"511 520 NA 611;"
"322 GA 620"  
"3 321;"
"334344"

expected Output :

+Column           | +new column
"511 520 NA 611;" | [511,520,NA 611]
"322 GA 620"      | [322,GA 620]
"3 321; "         | [3,321]
"334 344"         | [334,344]

try :

data = data.withColumn(
"newcolumn",
split(col("column"), "\s"))

but i get an empty string at the end of the array like here and i want to delete it if exist

+Column        | +new column
"511 520 NA 611;" | [511,520,NA,611;]
"322 GA 620"      | [322,GA,620]
"3 321;"       | [3,321;]
"334 344"      | [334,344]

2 Answers 2

1

You can use regexp_replace to replace the ";" at the end of the string first, and then execute split. Regular expression ";$" indicates that match the string ends with ";".

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, regexp_replace

spark = SparkSession.builder.getOrCreate()

data = [
    ("511 520 NA 611;",),
    ("322 GA 620",),
    ("3 321;",),
    ("334 344",)
]

df = spark.createDataFrame(data, ['column'])
df = df.withColumn("newcolumn", split(regexp_replace(col("column"), ';$', ''), "\\s"))
df.show(truncate=False)
Sign up to request clarification or add additional context in comments.

2 Comments

this works to deal with the ";" at the end but still not skip the split if a digit is followed as i explain in the output expected
Only think of a more dumb way. First use the regexp_extract_all function to extract all special strings, then perform split on the rest of the strings, and finally concatenate the extracted special strings with the split result. But this cannot guarantee the order.
0

As mentioned in the commends you can use regexp_extract_all together with the right regexp as shown below:

from pyspark.sql import functions as F
data = [
  ["511 520 NA 611;"],
  ["322 GA 620"],
  ["3 321;"],
  ["334344"]
]

df = spark.createDataFrame(data, ["value"]) 

df.withColumn("extracted_value", F.expr("regexp_extract_all(value, '(\\\d+)|(\\\w+\\\s\\\d+)', 0)")).show()

# +---------------+------------------+
# |          value|   extracted_value|
# +---------------+------------------+
# |511 520 NA 611;|[511, 520, NA 611]|
# |     322 GA 620|     [322, GA 620]|
# |         3 321;|          [3, 321]|
# |         334344|          [334344]|
# +---------------+------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.