I'm trying to read csv file using pyspark-sql, most of the column names will have special characters.I would like to get remove the special characters in all column names using pyspark dataframe.Is there any specific function available to remove special characters at once for all the column names ? I appreciate your response.
1 Answer
Try with using regular expression replace to replace all special characters and then use .toDF()
Example:
df=spark.createDataFrame([('a','b','v','d')],['._a','/b','c ','d('])
import re
cols=[re.sub("(_|\.|\(|\/)","",i) for i in df.columns]
df.toDF(*cols).show()
#+---+---+---+---+
#| a| b| c | d|
#+---+---+---+---+
#| a| b| v| d|
#+---+---+---+---+
Using .withColumnRenamed():
for i,j in zip(df.columns,cols):
df=df.withColumnRenamed(i,j)
df.show()
#+---+---+---+---+
#| a| b| c | d|
#+---+---+---+---+
#| a| b| v| d|
#+---+---+---+---+