Pyspark dataframe split and pad delimited column value into Array of N index

Question

There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column. Example of source dataframe given below:

X
A123-B345-C44656-D4423-E3445-F5667
X123-Y345
Z123-N345-T44656-M4423
X123

Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".

Resulting output should be like below:

X	Col1	Col2	Col3	Col4
A123-B345-C44656-D4423-E3445-F5667	A123	B345	C44656	D4423
X123-Y345	A123	Y345
Z123-N345-T44656-M4423	Z123	N345	T44656	M4423
X123	X123

Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:

    def pad_infinite(siterable, padding=None):
        return chain(iterable, repeat(padding))

    def pad(iterable, size, padding=None):
        return islice(pad_infinite(iterable, padding), size)
    
    colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))

werner · Accepted Answer · 2021-07-21 06:52:56Z

2

You can split the string into an array, separate the elements of the array into columns and then fill the null values with an empty string:

df = ...
df.withColumn("arr", F.split("X", "-")) \
    .selectExpr("X", "arr[0] as Col1", "arr[1] as Col2", "arr[2] as Col3", "arr[3] as Col4") \
    .na.fill("") \
    .show(truncate=False)

Output:

+----------------------------------+----+----+------+-----+
|X                                 |Col1|Col2|Col3  |Col4 |
+----------------------------------+----+----+------+-----+
|A123-B345-C44656-D4423-E3445-F5667|A123|B345|C44656|D4423|
|X123-Y345                         |X123|Y345|      |     |
|Z123-N345-T44656-M4423            |Z123|N345|T44656|M4423|
|X123                              |X123|    |      |     |
+----------------------------------+----+----+------+-----+

answered Jul 21, 2021 at 6:52

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Amit Singh Over a year ago

Split is not happening as expected with this and also it is truncating my rest dataframe columns once this runs. No impact from any above step, i tried with sample dataframe, but same complete df truncation is seen.

werner Over a year ago

@AmitSingh what result do you see from spilt? The colum arr should be a string array. Do you get this array? Dropping the remaining columns is expected. This happens in the selectExpr statement. You can either add all other required columns to the list of columns or you can try selectExpr("*", "arr[0]", ...)

Amit Singh Over a year ago

apologies! Split is returning the expected array. Also selectExpr needed a * character for keeping all other columns intact as you said. Thanks a ton for your help, this is an approved and expected Pyspark answer. Marking it approved.

Collectives™ on Stack Overflow

Pyspark dataframe split and pad delimited column value into Array of N index

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related