Databricks package com.databricks.spark.xml with encoding issues

Question

I´m dealing with an encoding problem, almost resolved with the decode/encode of a needed field in a dataframe, as the following example:

df.withColumn("column1", decode(encode("column1", "windows-1252"), "UTF8"))

Getting the values from this to , as example.

However in some special cases as, "Á" or "Í", I can't get the expected result:

From this to this

Anyone dealing with the same problems and getting good results with other solution?

Thanks in advance!

ptfaferreira · Accepted Answer · 2020-11-17 15:24:24Z

2

I resolve this problem changing the encode to iso-8859-15. And modifying the load of the data also to this encode type as the example below:

df = (
spark.read.format("com.databricks.spark.xml")
.option("encoding", "UTF-8")
.option("charset", "iso-8859-15")
.option("rowTag", "Header")
.load(folder_path)

answered Nov 17, 2020 at 15:24

ptfaferreira

5926 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Databricks package com.databricks.spark.xml with encoding issues

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related