skip the split if digit follows a string to get an array in python/pyspark

Question

want to create a new column based on a string column that have as separator(" ") and skip the split if a digit followed and finally delete ";" in the end if exist using python/pyspark :

Inputs :

"511 520 NA 611;"
"322 GA 620"  
"3 321;"
"334344"

expected Output :

+Column           | +new column
"511 520 NA 611;" | [511,520,NA 611]
"322 GA 620"      | [322,GA 620]
"3 321; "         | [3,321]
"334 344"         | [334,344]

try :

data = data.withColumn(
"newcolumn",
split(col("column"), "\s"))

but i get an empty string at the end of the array like here and i want to delete it if exist

+Column        | +new column
"511 520 NA 611;" | [511,520,NA,611;]
"322 GA 620"      | [322,GA,620]
"3 321;"       | [3,321;]
"334 344"      | [334,344]

过过招 · Accepted Answer · 2021-10-02 00:18:33Z

1

You can use regexp_replace to replace the ";" at the end of the string first, and then execute split. Regular expression ";$" indicates that match the string ends with ";".

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, regexp_replace

spark = SparkSession.builder.getOrCreate()

data = [
    ("511 520 NA 611;",),
    ("322 GA 620",),
    ("3 321;",),
    ("334 344",)
]

df = spark.createDataFrame(data, ['column'])
df = df.withColumn("newcolumn", split(regexp_replace(col("column"), ';$', ''), "\\s"))
df.show(truncate=False)

edited Oct 2, 2021 at 0:18

answered Oct 1, 2021 at 8:18

过过招

4,3372 gold badges7 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

BADS Over a year ago

this works to deal with the ";" at the end but still not skip the split if a digit is followed as i explain in the output expected

过过招 Over a year ago

Only think of a more dumb way. First use the regexp_extract_all function to extract all special strings, then perform split on the rest of the strings, and finally concatenate the extracted special strings with the split result. But this cannot guarantee the order.

abiratsis · Accepted Answer · 2021-10-04 15:51:15Z

0

As mentioned in the commends you can use regexp_extract_all together with the right regexp as shown below:

from pyspark.sql import functions as F
data = [
  ["511 520 NA 611;"],
  ["322 GA 620"],
  ["3 321;"],
  ["334344"]
]

df = spark.createDataFrame(data, ["value"]) 

df.withColumn("extracted_value", F.expr("regexp_extract_all(value, '(\\\d+)|(\\\w+\\\s\\\d+)', 0)")).show()

# +---------------+------------------+
# |          value|   extracted_value|
# +---------------+------------------+
# |511 520 NA 611;|[511, 520, NA 611]|
# |     322 GA 620|     [322, GA 620]|
# |         3 321;|          [3, 321]|
# |         334344|          [334344]|
# +---------------+------------------+

answered Oct 4, 2021 at 15:51

abiratsis

7,3414 gold badges31 silver badges49 bronze badges

Collectives™ on Stack Overflow

skip the split if digit follows a string to get an array in python/pyspark

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related