Remove special characters from column names using pyspark dataframe

Question

I'm trying to read csv file using pyspark-sql, most of the column names will have special characters.I would like to get remove the special characters in all column names using pyspark dataframe.Is there any specific function available to remove special characters at once for all the column names ? I appreciate your response.

notNull · Accepted Answer · 2020-08-05 22:11:40Z

2

Try with using regular expression replace to replace all special characters and then use .toDF()

Example:

df=spark.createDataFrame([('a','b','v','d')],['._a','/b','c ','d('])
import re
cols=[re.sub("(_|\.|\(|\/)","",i) for i in df.columns]
df.toDF(*cols).show()
#+---+---+---+---+
#|  a|  b| c |  d|
#+---+---+---+---+
#|  a|  b|  v|  d|
#+---+---+---+---+

Using .withColumnRenamed():

for i,j in zip(df.columns,cols):
    df=df.withColumnRenamed(i,j)

df.show()
#+---+---+---+---+
#|  a|  b| c |  d|
#+---+---+---+---+
#|  a|  b|  v|  d|
#+---+---+---+---+

edited Aug 5, 2020 at 22:11

answered Aug 5, 2020 at 22:06

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Remove special characters from column names using pyspark dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related