In PySpark, I am trying to clean a dataset. Some of the columns have unwanted characters (=" ") in it's values. I read the dataset as a DataFrame and I already created a User Defined Function which can remove the characters successfully, but now I am struggling to write a script which can identify on which columns I need to perform the UserDefinedFunction. I only use the last row of the dataset, assuming the columns always contains similar entries.
DataFrame (df):
id value1 value2 value3
="100010" 10 20 ="30"
In Python, the following works:
columns_to_fix = []
for col in df:
value = df[col][0]
if type(value) == str and value.startswith('='):
columns_to_fix.append(col)
I tried the following in PySpark, but this returns all the column names:
columns_to_fix = []
for x in df.columns:
if df[x].like('%="'):
columns_to_fix.append(x)
Desired output:
columns_to_fix: ['id', 'value3']
Once I have the column names in a list, I can use a for loop to fix the entries in the columns. I am very new to PySpark, so my apologies if this is a too basic question. Thank you so much in advance for your advice!
df[x].like('%="')return an object which is notNonetherefore the test is alwaysTrue. You need tocollect()to check the content.