0

There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column. Example of source dataframe given below:

X
A123-B345-C44656-D4423-E3445-F5667
X123-Y345
Z123-N345-T44656-M4423
X123

Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".

Resulting output should be like below:

X Col1 Col2 Col3 Col4
A123-B345-C44656-D4423-E3445-F5667 A123 B345 C44656 D4423
X123-Y345 A123 Y345
Z123-N345-T44656-M4423 Z123 N345 T44656 M4423
X123 X123

Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:

    def pad_infinite(siterable, padding=None):
        return chain(iterable, repeat(padding))

    def pad(iterable, size, padding=None):
        return islice(pad_infinite(iterable, padding), size)
    
    colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))

1 Answer 1

2

You can split the string into an array, separate the elements of the array into columns and then fill the null values with an empty string:

df = ...
df.withColumn("arr", F.split("X", "-")) \
    .selectExpr("X", "arr[0] as Col1", "arr[1] as Col2", "arr[2] as Col3", "arr[3] as Col4") \
    .na.fill("") \
    .show(truncate=False)

Output:

+----------------------------------+----+----+------+-----+
|X                                 |Col1|Col2|Col3  |Col4 |
+----------------------------------+----+----+------+-----+
|A123-B345-C44656-D4423-E3445-F5667|A123|B345|C44656|D4423|
|X123-Y345                         |X123|Y345|      |     |
|Z123-N345-T44656-M4423            |Z123|N345|T44656|M4423|
|X123                              |X123|    |      |     |
+----------------------------------+----+----+------+-----+
Sign up to request clarification or add additional context in comments.

3 Comments

Split is not happening as expected with this and also it is truncating my rest dataframe columns once this runs. No impact from any above step, i tried with sample dataframe, but same complete df truncation is seen.
@AmitSingh what result do you see from spilt? The colum arr should be a string array. Do you get this array? Dropping the remaining columns is expected. This happens in the selectExpr statement. You can either add all other required columns to the list of columns or you can try selectExpr("*", "arr[0]", ...)
apologies! Split is returning the expected array. Also selectExpr needed a * character for keeping all other columns intact as you said. Thanks a ton for your help, this is an approved and expected Pyspark answer. Marking it approved.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.