0

I have a dataframe. A column "names" contains columns headers, which values should be concated. I want to do it with pyspark concat_ws() but nothing works. I must solve it with concat_ws() function, no pandas etc.

The best what I got it was concated headers, but not values in this columns. I couldn't return list from function to unpack it in concat_ws()

map_cols = {'a':'newA', 'b':'newB', 'c':'newC', 'd':'newD'}

@udf
def get_result(names_col):
    headers = []
    for i in names_col:
        headers.append(map_cols[i])
    return headers

df = df.withColumn('names_arr', split('names', '_')).withColumn('result', concat_ws(';', get_result(col('names_arr'))))
Input dataframe:

names   | newA|newB|newC|newD
---------------------------
a_b     |1    | 2  | 7  |8
---------------------------
a_b_c   |2    | 3  | 4  |4
---------------------------
a_b_c_d |3    | 2  |4   |4
---------------------------
c_d     | 89  |  5 |3   |5
---------------------------
b_c_d   |  7  |5   |6   | 5


Expected output dataframe

names   | newA|newB|newC|newD|result
--------------------------------------
a_b     |1    | 2  | 7  | 8  |1;2
--------------------------------------
a_b_c   |2    | 3  | 4  |4   |2;3;4
--------------------------------------
a_b_c_d |3    | 2  |4   |4   |2;3;4;4
--------------------------------------
c_d     |89   |  5 |3   |5   |3;5
--------------------------------------
b_c_d   |7    |5   |6   | 5  |5;6;5

1 Answer 1

1

I am assuming that in your expected output colA is a typo for the last two rows (89 and 7)

You can iterate the dataframe.columns and perform concat_ws

# Skip data prepare

#import 
import pyspark.sql.functions as f

df.show()
+-------+----+----+----+----+
|  names|newA|newB|newC|newD|
+-------+----+----+----+----+
|    a_b|   1|   2|null|null|
|  a_b_c|   2|   3|   4|null|
|a_b_c_d|   3|   2|   4|   4|
|    c_d|null|null|   3|   5|
|  b_c_d|null|   5|   6|   5|
+-------+----+----+----+----+

Filetring column name if its names and the concat by ; separator

df.withColumn('result', f.concat_ws(';', *[c for c in df.columns if c!='names'])).show()
+-------+----+----+----+----+-------+
|  names|newA|newB|newC|newD| result|
+-------+----+----+----+----+-------+
|    a_b|   1|   2|null|null|    1;2|
|  a_b_c|   2|   3|   4|null|  2;3;4|
|a_b_c_d|   3|   2|   4|   4|3;2;4;4|
|    c_d|null|null|   3|   5|    3;5|
|  b_c_d|null|   5|   6|   5|  5;6;5|
+-------+----+----+----+----+-------+
Sign up to request clarification or add additional context in comments.

4 Comments

no, it isn't a typo and also I have other columns, not only these. I need to concat columns mentioned only in column 'names'
@Zhenya : Ok, Can you explain how 87 and 7 came in expected output as its not present in Input.
oh sorry, that's my fault, 87 and 7 should be in input. I edited my question
@samkart I changed the data one more time. Be more attentive! I think now you'll understand it correctly

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.