1

PROBLEM: Ruby 2.0 CSV reader on Mac Mavericks treats Microsoft Excel generated CSV files that have embedded HTML differently. Works fine on Ruby 1.8 with FasterCSV.

I just upgraded my Mac to Mavericks (OS X 10.9.4) and also upgraded Ruby to 2.0.0p451 (I used to use Ruby 1.8+ with the FasterCSV gem but now use Ruby 2.0+ with it's native CSV.)

Ruby Version:

ruby -v
ruby 2.0.0p451 (2014-02-24 revision 45167) [universal.x86_64-darwin13]

The CSV file is generated from Office 2011, saved from an original ".xlsx" file.

The following HTML is contained in a single cell of the Microsoft .xlsx file BEFORE it is saved as CSV...

<h1 style="text-align:center; font: bold 1.5em Arial;">This is the Title</h1>
<p style="text-align:center;"><img style="width:300px; height:100px" src="./IMAGES/MAIN/image1.png" alt="Image 1"/></p>
<p style="text-align:center;">This is a sentence.</p>

There are other cells, that also have HTML code embedded.

To reproduce...

  1. Open an Excel worksheet
  2. Copy the above HTML into cell A1 (ensure that there are Mac carriage returns control+command+return between HTML constructs (e.g. between the end of the "h1" construct and the start of a new "p" construct, in order to ensure line breaks between all complete HTML constructs, right in the Excel cells.
  3. Copy what in cell A1 to cell A2, directly below cell A1, to ensure multiple CSV rows (your file will have two formal CSV rows).
  4. First save the file as an xlsx file (e.g. "file.xlsx")
  5. Then save the worksheet as a CSV file (e.g. "file.csv").

You will now have an Excel generated CSV file that has two formal CSV rows, where each row will have multiple HTML constructs that are separated by line feeds, within it.

Reading the CSV File...

I use the following code to read CSV file and print the contents of each cell, both before and after I try to strip control characters...

arrayOfHtmlConstructs = CSV.read( file.csv )
arrayOfHtmlConstructs.each_with_index do | construct, i|
  output = "" << construct.to_s
  puts "BEFORE: " << output
  output = output.gsub(/\r/, "") # Replace Microsoft carriage returns FAILS!
  output = output.gsub(/\\"/, "\"") # Replace escaped quotes with quotes WORKS FINE!
  output = output.gsub(/\[\"/, "") # Remove prefix [" WORKS FINE!
  output = output.gsub(/\"\]/, "") # Remove suffix "]  WORKS FINE!
  puts "AFTER: " << output
end

Before trying to strip code, the CSV string "output" looks as follows...

BEFORE: ["<h1 style=\"text-align:center; font: bold 1.5em Arial;\">This is the Title</h1>\r<p style=\"text-align:center;\"><img style=\"width:300px; height:100px\" src=\"./IMAGES/MAIN/image1.png\" alt=\"Image 1\"/></p>\r<p style=\"text-align:center;\">This is a sentence.</p>"]

You'll notice that it includes [" at the beginning and ]" at the end, along with escaped quotes and embedded carriage returns /r

PROBLEM: All of the gsub statements work except for the one that tries to replace all carriage returns with blanks.

After running the Ruby script, the string "output" looks as follows, where everything gets substituted properly, except for the carriage returns...

AFTER: <h1 style="text-align:center; font: bold 1.5em Arial;">This is the Title</h1>\r<p style="text-align:center;"><img style="width:300px; height:100px" src="./IMAGES/MAIN/image1.png" alt="Image 1"/></p>\r<p style="text-align:center;">This is a sentence.</p>

For some reason, the carriage returns are NOT being replaced/substituted.

Also, before I upgraded to Ruby 2.0, I used to use FasterCSV and none of the substitution statements were needed. Everything just worked.

Any thoughts as to why this is all happening and how to properly handle it? Any assistance is greatly appreciated.

4
  • Tip: don't use gsub to try to clean up arbitrary HTML. Use an HTML parser like Nokogiri's HTML Fragments. Commented Jul 3, 2014 at 3:15
  • So, to be clear, you're saying that in your Ruby 2.0 install a string that inspects with \r inside it does not have that removed when you str.gsub(/\r/,'')? Is that the core of your problem, unrelated to CSV? If so, I cannot reproduce those results on ruby 2.0.0p353 on OS X Mavericks. Can you provide any way to reproduce your problem, simply? Commented Jul 3, 2014 at 3:16
  • It would be really helpful if you could supply something like p File.open('my.csv', 'r:utf-8', &:read) so that we could reproduce by recreating your test file exactly. But please pare it down to the smallest file that causes the problems. Commented Jul 3, 2014 at 3:30
  • Hi Phrogz, I added some notes on how to reproduce the CSV from Excel, which might help clarify. Thx. Commented Jul 3, 2014 at 4:11

2 Answers 2

2

The scope of my answer has changed so I've edited down to just the RegEx as that seems to be more on topic.

I've updated my expression to cover all of your substitutions, simply update with this block of code:

arrayOfHtmlConstructs.each_with_index do | construct, i|
  output = "" << construct.to_s
  puts "BEFORE: " << output
  output = output.gsub(/\\"/, "\"") # Replace escaped quotes with quotes WORKS FINE!
  output = output.gsub(/(\\r|\[|\])/, "")
  puts "AFTER: " << output
end
Sign up to request clarification or add additional context in comments.

4 Comments

Hi Anthony. I feel like it shouldn't be this hard. The other substitutions are all working so I wonder if there's some more accurate regular expression that would correct the substitution. Thx.
Hi @InformationTechnology can you check my latest edit and let me know if that works for you?
Hi @Anthony. This new regular expression works. I was originally trying to find a way to escape the line feed in the regular expression, to get the same effects) but didn't know about the parenthesis. What do the parenthesis "mean" in the regular expression? Thx.
Hi @InformationTechnology great! I'm glad that worked. Parens are a great way to group & capture within an expression. There's some wonderful reading on it here. I updated my answer so you should be all set - the biggest thing you missed was escaping the \r with a \\r in the expression. In order to grab the leading '\', you needed to escape it.
1

Try this:

@csv = CSV.read(params[:file].path, headers: true, skip_blanks: true, encoding:'windows-1256:utf-8')

You need to do the Microsoft CSV encoding

2 Comments

The issue with this solution is that it implies that the file you are reading is encoded for Microsoft and, therefore, Microsoft generated, when it is common to load files that come from many sources, other than Microsoft. When you receive a CSV file from someone like a customer, you don't know whether it was generated from Microsoft software or other tools.
This should cover OSX and Windows as far as I know. We are huge on CSV and haven't had many problems with this solution

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.