2

The column has multiple usage of the delimiter in a single row, hence split is not as straightforward.
Upon splitting, only the 1st delimiter occurrence has to be considered in this case.

As of now, I am doing this.

However, I feel there can be a better solution?

testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])

testdf.show()

+------+---------------+
|Animal|           Food|
+------+---------------+
|   Dog|meat,bread,milk|
|   Cat|     mouse,fish|
+------+---------------+

testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
        .withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
        .withColumn("Food2",expr("substring(Food2, 2)")).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+

3 Answers 3

5

I would just use string functions, dont see a reason to use regex.

from pyspark.sql import functions as F

testdf\
      .withColumn("Food1", F.expr("""substring(Food,1,instr(Food,',')-1)"""))\
      .withColumn("Food2", F.expr("""substring(Food,instr(Food,',')+1,length(Food))""")).show()

#+------+---------------+-----+----------+
#|Animal|           Food|Food1|     Food2|
#+------+---------------+-----+----------+
#|   Dog|meat,bread,milk| meat|bread,milk|
#|   Cat|     mouse,fish|mouse|      fish|
#+------+---------------+-----+----------+*
Sign up to request clarification or add additional context in comments.

1 Comment

This works perfectly fine. But is it possible to return null if the delimiter is not present in the string?
4

An approach using regular expression to split only first occurrence from the list

testdf.withColumn('Food1',f.split('Food',"(?<=^[^,]*)\\,")[0]).\
       withColumn('Food2',f.split('Food',"(?<=^[^,]*)\\,")[1]).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+

2 Comments

I have now actually accepted this answer, because it handles the corner case when there isn't even 1 comma. (Cow, grass) would be split into (Cow,grass,grass,null) for our example.
@ Shubham - you mind providing an explanation for the regex? (in case someone has a different delimiter)
3

A slightly different approach is to use slice and trim:

from pyspark.sql.functions import expr, split

df.withColumn("food_ar", split("food", ",")) \
  .select( \
         df.Animal,
         df.Food,
         expr("food_ar[0]").alias("Food1"),
         expr("trim('[]', string(slice(food_ar, 2, size(food_ar) - 1)))").alias("Food2"))

# +------+---------------+-----+----------+
# |Animal|           Food|Food1|     Food2|
# +------+---------------+-----+----------+
# |   Dog|meat,bread,milk| meat|bread,milk|
# |   Cat|     mouse,fish|mouse|      fish|
# +------+---------------+-----+----------+

First use split first as you already did to generate the array. Next we access the items using single Spark SQL accessor a[0] to populate the head and slice together with trim for the tail of the array.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.