Regexp_Replace in pyspark not working properly

Question

I am reading a csv file which is something like:

"ZEN","123"
"TEN","567"

Now if I am replacing character E with regexp_replace , its not giving correct results:

from pyspark.sql.functions import 
    row_number,col,desc,date_format,to_date,to_timestamp,regexp_replace

inputDirPath="/FileStore/tables/test.csv"

schema = StructType()
for field in fields:
    colType = StringType()
    schema.add(field.strip(),colType,True)

incr_df = spark.read.format("csv").option("header", 
         "false").schema(schema).option("delimiter", "\u002c").option("nullValue", 
          "").option("emptyValue","").option("multiline",True).csv(inputDirPath)

for column in incr_df.columns:
     inc_new=incr_df.withColumn(column, regexp_replace(column,"E","") )

inc_new.show()

is not giving correct results, it is doing nothing

Note : I have 100+ columns, so need to use for loop

can someone help in spotting my error?

the problem is the loop: in each iteration of the loop you take the data again and again from incr_df and overwrite the inc_new from the previous execution — werner
– werner, Commented Oct 11, 2022 at 16:56
before the loop: inc_new=incr_df. And then inside of the loop: inc_new=inc_new.withColumn... — werner
– werner, Commented Oct 11, 2022 at 17:00

wwnde · Accepted Answer · 2022-10-11 22:15:58Z

1

List comprehension will be neater and easier. Lets try

inc_new =inc_new.select(*[regexp_replace(x,'E','').alias(x) for x in  inc_new.columns])

inc_new.show()

answered Oct 11, 2022 at 22:15

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Regexp_Replace in pyspark not working properly

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related