1

I am reading a csv file which is something like:

"ZEN","123"
"TEN","567"

Now if I am replacing character E with regexp_replace , its not giving correct results:

from pyspark.sql.functions import 
    row_number,col,desc,date_format,to_date,to_timestamp,regexp_replace

inputDirPath="/FileStore/tables/test.csv"

schema = StructType()
for field in fields:
    colType = StringType()
    schema.add(field.strip(),colType,True)

incr_df = spark.read.format("csv").option("header", 
         "false").schema(schema).option("delimiter", "\u002c").option("nullValue", 
          "").option("emptyValue","").option("multiline",True).csv(inputDirPath)

for column in incr_df.columns:
     inc_new=incr_df.withColumn(column, regexp_replace(column,"E","") )

inc_new.show()

is not giving correct results, it is doing nothing

Note : I have 100+ columns, so need to use for loop

can someone help in spotting my error?

3
  • the problem is the loop: in each iteration of the loop you take the data again and again from incr_df and overwrite the inc_new from the previous execution Commented Oct 11, 2022 at 16:56
  • so what's way to prevent overwriting, can you please help Commented Oct 11, 2022 at 16:58
  • before the loop: inc_new=incr_df. And then inside of the loop: inc_new=inc_new.withColumn... Commented Oct 11, 2022 at 17:00

1 Answer 1

1

List comprehension will be neater and easier. Lets try

inc_new =inc_new.select(*[regexp_replace(x,'E','').alias(x) for x in  inc_new.columns])

inc_new.show()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.