how to concat values of columns in pyspark

Question

I have a dataframe. A column "names" contains columns headers, which values should be concated. I want to do it with pyspark concat_ws() but nothing works. I must solve it with concat_ws() function, no pandas etc.

The best what I got it was concated headers, but not values in this columns. I couldn't return list from function to unpack it in concat_ws()

map_cols = {'a':'newA', 'b':'newB', 'c':'newC', 'd':'newD'}

@udf
def get_result(names_col):
    headers = []
    for i in names_col:
        headers.append(map_cols[i])
    return headers

df = df.withColumn('names_arr', split('names', '_')).withColumn('result', concat_ws(';', get_result(col('names_arr'))))

Input dataframe:

names   | newA|newB|newC|newD
---------------------------
a_b     |1    | 2  | 7  |8
---------------------------
a_b_c   |2    | 3  | 4  |4
---------------------------
a_b_c_d |3    | 2  |4   |4
---------------------------
c_d     | 89  |  5 |3   |5
---------------------------
b_c_d   |  7  |5   |6   | 5


Expected output dataframe

names   | newA|newB|newC|newD|result
--------------------------------------
a_b     |1    | 2  | 7  | 8  |1;2
--------------------------------------
a_b_c   |2    | 3  | 4  |4   |2;3;4
--------------------------------------
a_b_c_d |3    | 2  |4   |4   |2;3;4;4
--------------------------------------
c_d     |89   |  5 |3   |5   |3;5
--------------------------------------
b_c_d   |7    |5   |6   | 5  |5;6;5

SMaZ · Accepted Answer · 2019-09-03 01:36:54Z

1

I am assuming that in your expected output colA is a typo for the last two rows (89 and 7)

You can iterate the dataframe.columns and perform concat_ws

# Skip data prepare

#import 
import pyspark.sql.functions as f

df.show()
+-------+----+----+----+----+
|  names|newA|newB|newC|newD|
+-------+----+----+----+----+
|    a_b|   1|   2|null|null|
|  a_b_c|   2|   3|   4|null|
|a_b_c_d|   3|   2|   4|   4|
|    c_d|null|null|   3|   5|
|  b_c_d|null|   5|   6|   5|
+-------+----+----+----+----+

Filetring column name if its names and the concat by ; separator

df.withColumn('result', f.concat_ws(';', *[c for c in df.columns if c!='names'])).show()
+-------+----+----+----+----+-------+
|  names|newA|newB|newC|newD| result|
+-------+----+----+----+----+-------+
|    a_b|   1|   2|null|null|    1;2|
|  a_b_c|   2|   3|   4|null|  2;3;4|
|a_b_c_d|   3|   2|   4|   4|3;2;4;4|
|    c_d|null|null|   3|   5|    3;5|
|  b_c_d|null|   5|   6|   5|  5;6;5|
+-------+----+----+----+----+-------+

edited Sep 3, 2019 at 1:36

answered Sep 3, 2019 at 1:30

SMaZ

2,6551 gold badge14 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Zhenya Over a year ago

no, it isn't a typo and also I have other columns, not only these. I need to concat columns mentioned only in column 'names'

SMaZ Over a year ago

@Zhenya : Ok, Can you explain how 87 and 7 came in expected output as its not present in Input.

Zhenya Over a year ago

oh sorry, that's my fault, 87 and 7 should be in input. I edited my question

Zhenya Over a year ago

@samkart I changed the data one more time. Be more attentive! I think now you'll understand it correctly

Collectives™ on Stack Overflow

how to concat values of columns in pyspark

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related