1

I have below array in dataframe

+-------------------------------------------------+
|typed_phone_numbers                              |
+-------------------------------------------------+
|[-5594162570~222222-PHONE~FAX-17-TEST]           |
|[-2812597115~1111111-PHONE~FAX-17-TESTB]         |
+-------------------------------------------------+

I want to create another element within the array if PHONE and FAX both are present in first element. if there is only phone or fax no need to create another element.

EXPECTED OUTPUT

+-------------------------------------------------------+
|typed_phone_numbers                                    |
+-------------------------------------------------------+
|["-5594162570-PHONE-17-TEST","-222222-FAX-17-TEST"]    |
|["-2812597115-PHONE-17-TESTB","-1111111-FAX-17-TESTB"] |
+-------------------------------------------------------+
2
  • Please provide a minimal reproducible example in the text of your question, including the code for what you've tried. Commented Apr 23, 2020 at 20:28
  • I don't have a single clue how to achieve that scenario Commented Apr 23, 2020 at 20:29

2 Answers 2

1

First you can split on both - and ~, remove none, check if phone and fax both exist(using higher order function filter) in when clause, then apply your logic using element_at,concat and concat_ws. (spark2.4+)

#sample data
#df.show()
#+----------------------------------------+
#|typed_phone_numbers                     |
#+----------------------------------------+
#|[-5594162570~222222-PHONE~FAX-17-TEST]  |
#|[-2812597115~1111111-PHONE~FAX-17-TESTB]|
#|[-2812597115~1111111-PHONE]             |
#+----------------------------------------+  



from pyspark.sql import functions as F
df.withColumn("yo", F.split(F.col("typed_phone_numbers")[0], '\-|~'))\
  .withColumn("yo", F.expr("""filter(yo,x-> x!='')"""))\
  .withColumn("typed_phone_numbers", F.when(F.size(F.expr("""filter(yo,x-> x='PHONE' or x='FAX')"""))==2,\
                           F.array(F.concat(F.lit('-'),F.concat_ws('-',F.element_at("yo",1),\
                                                   F.element_at("yo",3),\
                                                   F.element_at("yo",5),\
                                                   F.element_at("yo",6))),\
                           F.concat(F.lit('-'),F.concat_ws('-',F.element_at("yo",2),\
                                                   F.element_at("yo",4),\
                                                   F.element_at("yo",5),\
                                                   F.element_at("yo",6)))))\
              .otherwise(F.col("typed_phone_numbers"))).drop("yo").show(truncate=False)


#+---------------------------------------------------+
#|typed_phone_numbers                                |
#+---------------------------------------------------+
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST]   |
#|[-2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]|
#|[-2812597115~1111111-PHONE]                        |
#+---------------------------------------------------+

UPDATE:

Use higher order function transform to apply your logic to each element.

#sample data
#df.show()
#+------------------------------------------------------------------------------+
#|typed_phone_numbers                                                           |
#+------------------------------------------------------------------------------+
#|[-5594162570~222222-PHONE~FAX-17-TEST]                                        |
#|[-5594162570~222222-PHONE~FAX-17-TEST, -2812597115~1111111-PHONE~FAX-17-TESTB]|
#|[-2812597115~1111111-PHONE~FAX-17-TESTB]                                      |
#|[-2812597115~1111111-PHONE]                                                   |
#+------------------------------------------------------------------------------+


from pyspark.sql import functions as F
df\
  .withColumn("yo", F.expr("""(transform(typed_phone_numbers,x-> split(substring(x,2,length(x)),'\-|~')))"""))\
  .withColumn("typed_phone_numbers",F.when(F.size(F.expr("""filter(yo[0],x->x='PHONE' or x='FAX')"""))==2,\
                          F.flatten(F.expr("""transform(yo,y->\
                                                   array(concat('-',concat_ws('-',y[0],y[2],y[4],y[5])),\
                                                         concat('-',concat_ws('-',y[1],y[3],y[4],y[5]))))""")))\
                          .otherwise(F.col("typed_phone_numbers")))\
                          .drop("yo").show(truncate=False)


#+---------------------------------------------------------------------------------------------------+
#|typed_phone_numbers                                                                                |
#+---------------------------------------------------------------------------------------------------+
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST]                                                   |
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST, -2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]|
#|[-2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]                                                |
#|[-2812597115~1111111-PHONE]                                                                        |
#+---------------------------------------------------------------------------------------------------+

If you can have single PHONE or single FAX, in any array row(even with other PHONE+FAX), you could use this.

#+------------------------------------------------------------------------------+
#|typed_phone_numbers                                                           |
#+------------------------------------------------------------------------------+
#|[-5594162570~222222-PHONE~FAX-17-TEST, -2812597115~1111111-PHONE]             |
#|[-5594162570~222222-PHONE~FAX-17-TEST, -2812597115~1111111-PHONE~FAX-17-TESTB]|
#|[-2812597115~1111111-PHONE~FAX-17-TESTB, -2812597115~1111111-FAX]             |
#|[-2812597115~1111111-PHONE]                                                   |
#+------------------------------------------------------------------------------+

from pyspark.sql import functions as F
df\
  .withColumn("yo", F.expr("""(transform(typed_phone_numbers,x-> split(substring(x,2,length(x)),'\-|~')))"""))\
  .withColumn("typed_phone_numbers",\
                          F.flatten(F.expr("""transform(yo,y->\
                          IF((array_contains(y,'PHONE')==True) and (array_contains(y,'FAX')==True),\
                                                   array(concat('-',concat_ws('-',y[0],y[2],y[4],y[5])),\
                                                         concat('-',concat_ws('-',y[1],y[3],y[4],y[5]))),\
                                                         array(concat('-',concat_ws('-',y)))))""")))\
                          .drop("yo").show(truncate=False)


#+---------------------------------------------------------------------------------------------------+
#|typed_phone_numbers                                                                                |
#+---------------------------------------------------------------------------------------------------+
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST, -2812597115-1111111-PHONE]                        |
#|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST, -2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]|
#|[-2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB, -2812597115-1111111-FAX]                       |
#|[-2812597115-1111111-PHONE]                                                                        |
#+---------------------------------------------------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

what if there is multiple element in array this code will look for first element of array for example ['-5594162570~222222-PHONE~FAX-17-TEST','-2812597115~1111111-PHONE~FAX-17-TESTB'] in this case code will work for only 0th element
1

You can use a combination of regexp_replace(str, string-pattern, replacement-pattern) and split(col, string-pattern) like so

Your data

data = [
    (['-5594162570~222222-PHONE~FAX-17-TEST'],),
    (['-2812597115~1111111-PHONE~FAX-17-TESTB'],),
    (['-5594162570-PHONE-17-TEST'],),
    (['-2812597115-FAX-17-TESTB'],)
]
df = spark.createDataFrame(data, ['typed_phone_numbers'])

Solution

from pyspark.sql.functions import col, regexp_replace, split
(
    df.
        withColumn(
            'typed_phone_numbers',
            split(
                regexp_replace(
                    regexp_replace(
                        col('typed_phone_numbers')[0],
                        '^(-\\d+)(~\\d+)(-PHONE)(~FAX)(-\\d+-\\w+)$',
                        '$1$3$5,$2$4$5'
                    ),
                    '~',
                    '-'
                ),
                ','
            )
        ).
        show(truncate=False)
)
+---------------------------------------------------+                           
|typed_phone_numbers                                |
+---------------------------------------------------+
|[-5594162570-PHONE-17-TEST, -222222-FAX-17-TEST]   |
|[-2812597115-PHONE-17-TESTB, -1111111-FAX-17-TESTB]|
|[-5594162570-PHONE-17-TEST]                        |
|[-2812597115-FAX-17-TESTB]                         |
+---------------------------------------------------+

1 Comment

what if there is multiple element in array this code will look for first element of array for example ['-5594162570~222222-PHONE~FAX-17-TEST','-2812597115~1111111-PHONE~FAX-17-TESTB'] in this case code will work for only 0th element

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.