There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column. Example of source dataframe given below:
| X |
|---|
| A123-B345-C44656-D4423-E3445-F5667 |
| X123-Y345 |
| Z123-N345-T44656-M4423 |
| X123 |
Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".
Resulting output should be like below:
| X | Col1 | Col2 | Col3 | Col4 |
|---|---|---|---|---|
| A123-B345-C44656-D4423-E3445-F5667 | A123 | B345 | C44656 | D4423 |
| X123-Y345 | A123 | Y345 | ||
| Z123-N345-T44656-M4423 | Z123 | N345 | T44656 | M4423 |
| X123 | X123 |
Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:
def pad_infinite(siterable, padding=None):
return chain(iterable, repeat(padding))
def pad(iterable, size, padding=None):
return islice(pad_infinite(iterable, padding), size)
colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))