create separate array element based on condition

Question

I have below array in dataframe

+-------------------------------------------------+
|typed_phone_numbers                              |
+-------------------------------------------------+
|[-5594162570~222222-PHONE~FAX-17-TEST]           |
|[-2812597115~1111111-PHONE~FAX-17-TESTB]         |
+-------------------------------------------------+

I want to create another element within the array if PHONE and FAX both are present in first element. if there is only phone or fax no need to create another element.

EXPECTED OUTPUT

+-------------------------------------------------------+
|typed_phone_numbers                                    |
+-------------------------------------------------------+
|["-5594162570-PHONE-17-TEST","-222222-FAX-17-TEST"]    |
|["-2812597115-PHONE-17-TESTB","-1111111-FAX-17-TESTB"] |
+-------------------------------------------------------+

Please provide a minimal reproducible example in the text of your question, including the code for what you've tried. — ranka47
– ranka47, Commented Apr 23, 2020 at 20:28

murtihash · Accepted Answer · 2020-04-29 06:43:17Z

First you can split on both - and ~, remove none, check if phone and fax both exist(using higher order function filter) in when clause, then apply your logic using element_at,concat and concat_ws. (spark2.4+)

#sample data
#df.show()
#+----------------------------------------+
#|typed_phone_numbers                     |
#+----------------------------------------+
#|[-5594162570~222222-PHONE~FAX-17-TEST]  |
#|[-2812597115~1111111-PHONE~FAX-17-TESTB]|
#|[-2812597115~1111111-PHONE]             |
#+----------------------------------------+  



from pyspark.sql import functions as F
df.withColumn("yo", F.split(F.col("typed_phone_numbers")[0], '\-|~'))\
  .withColumn("yo", F.expr("""filter(yo,x-> x!='')"""))\
  .withColumn("typed_phone_numbers", F.when(F.size(F.expr("""filter(yo,x-> x='PHONE' or x='FAX')"""))==2,\
                           F.array(F.concat(F.lit('-'),F.concat_ws('-',F.element_at("yo",1),\
                                                   F.element_at("yo",3),\
                                                   F.element_at("yo",5),\
                                                   F.element_at("yo",6))),\
                           F.concat(F.lit('-'),F.concat_ws('-',F.element_at("yo",2),\
                                                   F.element_at("yo",4),\
                                                   F.element_at("yo",5),\
                                                   F.element_at("yo",6)))))\
              .otherwise(F.col("typed_phone_numbers"))).drop("yo").show(truncate=False)


#+---------------------------------------------------+
#|typed_phone_numbers                                |
#+---------------------------------------------------+
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST]   |
#|[-2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]|
#|[-2812597115~1111111-PHONE]                        |
#+---------------------------------------------------+

UPDATE:

Use higher order function transform to apply your logic to each element.

#sample data
#df.show()
#+------------------------------------------------------------------------------+
#|typed_phone_numbers                                                           |
#+------------------------------------------------------------------------------+
#|[-5594162570~222222-PHONE~FAX-17-TEST]                                        |
#|[-5594162570~222222-PHONE~FAX-17-TEST, -2812597115~1111111-PHONE~FAX-17-TESTB]|
#|[-2812597115~1111111-PHONE~FAX-17-TESTB]                                      |
#|[-2812597115~1111111-PHONE]                                                   |
#+------------------------------------------------------------------------------+


from pyspark.sql import functions as F
df\
  .withColumn("yo", F.expr("""(transform(typed_phone_numbers,x-> split(substring(x,2,length(x)),'\-|~')))"""))\
  .withColumn("typed_phone_numbers",F.when(F.size(F.expr("""filter(yo[0],x->x='PHONE' or x='FAX')"""))==2,\
                          F.flatten(F.expr("""transform(yo,y->\
                                                   array(concat('-',concat_ws('-',y[0],y[2],y[4],y[5])),\
                                                         concat('-',concat_ws('-',y[1],y[3],y[4],y[5]))))""")))\
                          .otherwise(F.col("typed_phone_numbers")))\
                          .drop("yo").show(truncate=False)


#+---------------------------------------------------------------------------------------------------+
#|typed_phone_numbers                                                                                |
#+---------------------------------------------------------------------------------------------------+
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST]                                                   |
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST, -2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]|
#|[-2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]                                                |
#|[-2812597115~1111111-PHONE]                                                                        |
#+---------------------------------------------------------------------------------------------------+

If you can have single PHONE or single FAX, in any array row(even with other PHONE+FAX), you could use this.

#+------------------------------------------------------------------------------+
#|typed_phone_numbers                                                           |
#+------------------------------------------------------------------------------+
#|[-5594162570~222222-PHONE~FAX-17-TEST, -2812597115~1111111-PHONE]             |
#|[-5594162570~222222-PHONE~FAX-17-TEST, -2812597115~1111111-PHONE~FAX-17-TESTB]|
#|[-2812597115~1111111-PHONE~FAX-17-TESTB, -2812597115~1111111-FAX]             |
#|[-2812597115~1111111-PHONE]                                                   |
#+------------------------------------------------------------------------------+

from pyspark.sql import functions as F
df\
  .withColumn("yo", F.expr("""(transform(typed_phone_numbers,x-> split(substring(x,2,length(x)),'\-|~')))"""))\
  .withColumn("typed_phone_numbers",\
                          F.flatten(F.expr("""transform(yo,y->\
                          IF((array_contains(y,'PHONE')==True) and (array_contains(y,'FAX')==True),\
                                                   array(concat('-',concat_ws('-',y[0],y[2],y[4],y[5])),\
                                                         concat('-',concat_ws('-',y[1],y[3],y[4],y[5]))),\
                                                         array(concat('-',concat_ws('-',y)))))""")))\
                          .drop("yo").show(truncate=False)


#+---------------------------------------------------------------------------------------------------+
#|typed_phone_numbers                                                                                |
#+---------------------------------------------------------------------------------------------------+
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST, -2812597115-1111111-PHONE]                        |
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST, -2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]|
#|[-2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB, -2812597115-1111111-FAX]                       |
#|[-2812597115-1111111-PHONE]                                                                        |
#+---------------------------------------------------------------------------------------------------+

what if there is multiple element in array this code will look for first element of array for example ['-5594162570~222222-PHONE~FAX-17-TEST','-2812597115~1111111-PHONE~FAX-17-TESTB'] in this case code will work for only 0th element

CPak · Accepted Answer · 2020-04-24 04:24:14Z

1

You can use a combination of regexp_replace(str, string-pattern, replacement-pattern) and split(col, string-pattern) like so

Your data

data = [
    (['-5594162570~222222-PHONE~FAX-17-TEST'],),
    (['-2812597115~1111111-PHONE~FAX-17-TESTB'],),
    (['-5594162570-PHONE-17-TEST'],),
    (['-2812597115-FAX-17-TESTB'],)
]
df = spark.createDataFrame(data, ['typed_phone_numbers'])

Solution

from pyspark.sql.functions import col, regexp_replace, split
(
    df.
        withColumn(
            'typed_phone_numbers',
            split(
                regexp_replace(
                    regexp_replace(
                        col('typed_phone_numbers')[0],
                        '^(-\\d+)(~\\d+)(-PHONE)(~FAX)(-\\d+-\\w+)$',
                        '$1$3$5,$2$4$5'
                    ),
                    '~',
                    '-'
                ),
                ','
            )
        ).
        show(truncate=False)
)
+---------------------------------------------------+                           
|typed_phone_numbers                                |
+---------------------------------------------------+
|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST]   |
|[-2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]|
|[-5594162570-PHONE-17-TEST]                        |
|[-2812597115-FAX-17-TESTB]                         |
+---------------------------------------------------+

answered Apr 24, 2020 at 4:24

CPak

13.7k3 gold badges35 silver badges55 bronze badges

1 Comment

Jay Kakadiya Over a year ago

what if there is multiple element in array this code will look for first element of array for example ['-5594162570~222222-PHONE~FAX-17-TEST','-2812597115~1111111-PHONE~FAX-17-TESTB'] in this case code will work for only 0th element

Collectives™ on Stack Overflow

create separate array element based on condition

2 Answers 2

1 Comment

Your data

Solution

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Your data

Solution

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related