Parse CSV file in Scala

Question

I am trying to load a CSV file that has Japanese characters into a dataframe in scala. When I read a column value as "セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!" which is supposed to go in one column only, it breaks the string at "」"(considers this as new line) and creates two records. I have set the "charset" property to UTF-16 also, quote character is "\"", still it showing more records than the file.

val df = spark.read.option("sep", "\t").option("header", "true").option("charset","UTF-16").option("inferSchema", "true").csv("file.txt")

Any pointer on how to solve this would be very helpful.

I was able to read the same file without any issues in u-sql with unicode encoding which is UTF-16 only. u-sql link — sky
– sky, Commented Mar 9, 2019 at 19:20

curious_me · Accepted Answer · 2019-03-10 19:37:05Z

1

Looks like there's a new line character in your Japanese string. Can you try using the multiLine option while reading the file?

var data = spark.read.format("csv")
 .option("header","true")
 .option("delimiter", "\n")
 .option("charset", "utf-16")
 .option("inferSchema", "true")
 .option("multiLine", true)
 .load(filePath)

Note: As per the below answer there are some concerns with this approach when the input file is very big. How to handle multi line rows in spark?

edited Mar 10, 2019 at 19:37

answered Mar 10, 2019 at 19:29

curious_me

262 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

KZapagol · Accepted Answer · 2019-03-10 09:05:46Z

1

The below code should work for UTF-16. I couldn't able to set csv file encoding UTF-16 in Notepad++ and hence I have tested it with UTF-8. Please make sure that you have set input file encoding which is UTF-16.

Code snippet :

val br = new BufferedReader(
    new InputStreamReader(
      new FileInputStream("C:/Users/../Desktop/csvFile.csv"), "UTF-16"));

  for(line <- br.readLine()){
    print(line)
  }

  br.close();

csvFile content used:

【セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!,January, セキュリティ, 開催, 1000.00

Update:

If you want to load using spark then you can load csv file as below.

spark.read
      .format("com.databricks.spark.csv")
      .option("charset", "UTF-16")
      .option("header", "false")
      .option("escape", "\\")
      .option("delimiter", ",")
      .option("inferSchema", "false")
      .load(fromPath)

Sample Input file for above code:

  "102","03","セキュリティ対策ウェビナー開催中】受講登録でスグに役立つ「e-Book」を進呈!","ｶｸﾞﾗｱｶｶﾞﾜﾔﾂｷﾖｸ","セキュリティ","受講登録でス"

edited Mar 10, 2019 at 9:05

answered Mar 9, 2019 at 20:01

KZapagol

9186 silver badges9 bronze badges

3 Comments

sky Over a year ago

I am getting the count mismatch while loading the data into a dataframe. I have edited the question with the code. Do I need to put any other filter?

KZapagol Over a year ago

Please refer my updated comment. Hope it will help!

KZapagol Over a year ago

If it doesn't work updated code then please share your sample csv file.

Collectives™ on Stack Overflow

Parse CSV file in Scala

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related