Spark - split csv file using scala

Question

I have the following schema of csv file

(Id, OwnerUserId, CreationDate, ClosedDate, Score, Title, Body)

And I would like to split the data using:

val splitComma = file.map(x => x.split (",")
val splitComma = file.map(x => x.split (",(?![^<>]*</>)(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

both of them didn't work, the below is sample of my csv file :

90,58,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for TortoiseSVN?,"<p>Are there any really good tutorials explaining <a href=""http://svnbook.red-bean.com/en/1.8/svn.branchmerge.html"" rel=""nofollow"">branching and merging</a> with Apache Subversion? </p>

<p>All the better if it's specific to TortoiseSVN client.</p>
"
120,83,2008-08-01T15:50:08Z,NA,21,ASP.NET Site Maps,"<p>Has anyone got experience creating <strong>SQL-based ASP.NET</strong> site-map providers?</p>

<p>I've got the default XML file <code>web.sitemap</code> working properly with my Menu and <strong>SiteMapPath</strong> controls, but I'll need a way for the users of my site to create and modify pages dynamically.</p>

<p>I need to tie page viewing permissions into the standard <code>ASP.NET</code> membership system as well.</p>
"
180,2089740,2008-08-01T18:42:19Z,NA,53,Function for creating color wheels,"<p>This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate <code>N</code> colors, that are as distinguishable as possible where <code>N</code> is a parameter.</p>
"

What's the best way to work with this ?

Use the spark-csv library. split(",") doesn't always work for all data — OneCricketeer
– OneCricketeer, Commented May 2, 2017 at 17:35
How many "lines" do you have in a csv file (disregarding the fact that one "line" can be split across multiple lines because of the last field with a HTML-like content with new lines)? — Jacek Laskowski
– Jacek Laskowski, Commented May 4, 2017 at 4:45

Tzach Zohar · Accepted Answer · 2017-05-02 18:02:42Z

3

You can't load CSVs with multi-line values (i.e. newlines within cells) using Spark: the underlying HadoopInputFormat will split the file based on newlines, disregarding the CSV's encapsulating double-quotes, so there isn't much Spark can do about it (see discussion here).

Unfortunately that means you'll have to find some why of "cleaning" your data (e.g. replacing newlines with some placeholder) before writing it to disk or loading it using Spark.

answered May 2, 2017 at 18:02

Tzach Zohar

37.9k3 gold badges83 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark - split csv file using scala

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related