0

I have the following schema of csv file

(Id, OwnerUserId, CreationDate, ClosedDate, Score, Title, Body)

And I would like to split the data using:

val splitComma = file.map(x => x.split (",")
val splitComma = file.map(x => x.split (",(?![^<>]*</>)(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

both of them didn't work, the below is sample of my csv file :

90,58,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for TortoiseSVN?,"<p>Are there any really good tutorials explaining <a href=""http://svnbook.red-bean.com/en/1.8/svn.branchmerge.html"" rel=""nofollow"">branching and merging</a> with Apache Subversion? </p>

<p>All the better if it's specific to TortoiseSVN client.</p>
"
120,83,2008-08-01T15:50:08Z,NA,21,ASP.NET Site Maps,"<p>Has anyone got experience creating <strong>SQL-based ASP.NET</strong> site-map providers?</p>

<p>I've got the default XML file <code>web.sitemap</code> working properly with my Menu and <strong>SiteMapPath</strong> controls, but I'll need a way for the users of my site to create and modify pages dynamically.</p>

<p>I need to tie page viewing permissions into the standard <code>ASP.NET</code> membership system as well.</p>
"
180,2089740,2008-08-01T18:42:19Z,NA,53,Function for creating color wheels,"<p>This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate <code>N</code> colors, that are as distinguishable as possible where <code>N</code> is a parameter.</p>
"

What's the best way to work with this ?

2
  • Use the spark-csv library. split(",") doesn't always work for all data Commented May 2, 2017 at 17:35
  • How many "lines" do you have in a csv file (disregarding the fact that one "line" can be split across multiple lines because of the last field with a HTML-like content with new lines)? Commented May 4, 2017 at 4:45

1 Answer 1

3

You can't load CSVs with multi-line values (i.e. newlines within cells) using Spark: the underlying HadoopInputFormat will split the file based on newlines, disregarding the CSV's encapsulating double-quotes, so there isn't much Spark can do about it (see discussion here).

Unfortunately that means you'll have to find some why of "cleaning" your data (e.g. replacing newlines with some placeholder) before writing it to disk or loading it using Spark.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.