How to split a pyspark dataframe column into only two columns (example below)?

Question

The column has multiple usage of the delimiter in a single row, hence split is not as straightforward.
Upon splitting, only the 1st delimiter occurrence has to be considered in this case.

As of now, I am doing this.

However, I feel there can be a better solution?

testdf= spark.createDataFrame([("Dog", "meat,bread,milk"), ("Cat", "mouse,fish")],["Animal", "Food"])

testdf.show()

+------+---------------+
|Animal|           Food|
+------+---------------+
|   Dog|meat,bread,milk|
|   Cat|     mouse,fish|
+------+---------------+

testdf.withColumn("Food1", split(col("Food"), ",").getItem(0))\
        .withColumn("Food2",expr("regexp_replace(Food, Food1, '')"))\
        .withColumn("Food2",expr("substring(Food2, 2)")).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+

murtihash · Accepted Answer · 2020-06-11 05:54:30Z

5

I would just use string functions, dont see a reason to use regex.

from pyspark.sql import functions as F

testdf\
      .withColumn("Food1", F.expr("""substring(Food,1,instr(Food,',')-1)"""))\
      .withColumn("Food2", F.expr("""substring(Food,instr(Food,',')+1,length(Food))""")).show()

#+------+---------------+-----+----------+
#|Animal|           Food|Food1|     Food2|
#+------+---------------+-----+----------+
#|   Dog|meat,bread,milk| meat|bread,milk|
#|   Cat|     mouse,fish|mouse|      fish|
#+------+---------------+-----+----------+*

answered Jun 11, 2020 at 5:54

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

thentangler Over a year ago

This works perfectly fine. But is it possible to return null if the delimiter is not present in the string?

Shubham Jain · Accepted Answer · 2020-06-11 10:13:00Z

4

An approach using regular expression to split only first occurrence from the list

testdf.withColumn('Food1',f.split('Food',"(?<=^[^,]*)\\,")[0]).\
       withColumn('Food2',f.split('Food',"(?<=^[^,]*)\\,")[1]).show()

+------+---------------+-----+----------+
|Animal|           Food|Food1|     Food2|
+------+---------------+-----+----------+
|   Dog|meat,bread,milk| meat|bread,milk|
|   Cat|     mouse,fish|mouse|      fish|
+------+---------------+-----+----------+

answered Jun 11, 2020 at 10:13

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

2 Comments

proutray Over a year ago

I have now actually accepted this answer, because it handles the corner case when there isn't even 1 comma. (Cow, grass) would be split into (Cow,grass,grass,null) for our example.

proutray Over a year ago

@ Shubham - you mind providing an explanation for the regex? (in case someone has a different delimiter)

abiratsis · Accepted Answer · 2020-06-12 13:19:59Z

A slightly different approach is to use slice and trim:

from pyspark.sql.functions import expr, split

df.withColumn("food_ar", split("food", ",")) \
  .select( \
         df.Animal,
         df.Food,
         expr("food_ar[0]").alias("Food1"),
         expr("trim('[]', string(slice(food_ar, 2, size(food_ar) - 1)))").alias("Food2"))

# +------+---------------+-----+----------+
# |Animal|           Food|Food1|     Food2|
# +------+---------------+-----+----------+
# |   Dog|meat,bread,milk| meat|bread,milk|
# |   Cat|     mouse,fish|mouse|      fish|
# +------+---------------+-----+----------+

First use split first as you already did to generate the array. Next we access the items using single Spark SQL accessor a[0] to populate the head and slice together with trim for the tail of the array.

Collectives™ on Stack Overflow

How to split a pyspark dataframe column into only two columns (example below)?

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related