I have a CSV file which has UTF-16 LE encoding.
I am able to parse the data using below code (Spark 2.4.5):
df = spark.read \
.schema('`my_id` string') \
.option('sep', '\t') \
.option('header', 'true') \
.option('encoding', 'UTF-16') \
.csv(my_path)
The Source data looks like this
my_id
123
456
When using df.show() or writing the data to Parquet df.repartition(1).write.mode('append').format('parquet').save(my_target_path') I get the below output
my_id
�
123�
456�
Opening the raw file in notepad++, i can see the below (note: Notepad++ used UCS-2 LE BOM)
When I open the file using VsCode, it uses UTF-16 LE
Question: Is it possible for me to use native spark.read.csv() to avoid the additional character which get added at the end of each line?


The spark dataframe returns? Did you use.show()to print it out?.show()might have used an incorrect encoding.