PROBLEM: Ruby 2.0 CSV reader on Mac Mavericks treats Microsoft Excel generated CSV files that have embedded HTML differently. Works fine on Ruby 1.8 with FasterCSV.
I just upgraded my Mac to Mavericks (OS X 10.9.4) and also upgraded Ruby to 2.0.0p451 (I used to use Ruby 1.8+ with the FasterCSV gem but now use Ruby 2.0+ with it's native CSV.)
Ruby Version:
ruby -v
ruby 2.0.0p451 (2014-02-24 revision 45167) [universal.x86_64-darwin13]
The CSV file is generated from Office 2011, saved from an original ".xlsx" file.
The following HTML is contained in a single cell of the Microsoft .xlsx file BEFORE it is saved as CSV...
<h1 style="text-align:center; font: bold 1.5em Arial;">This is the Title</h1>
<p style="text-align:center;"><img style="width:300px; height:100px" src="./IMAGES/MAIN/image1.png" alt="Image 1"/></p>
<p style="text-align:center;">This is a sentence.</p>
There are other cells, that also have HTML code embedded.
To reproduce...
- Open an Excel worksheet
- Copy the above HTML into cell A1 (ensure that there are Mac carriage returns control+command+return between HTML constructs (e.g. between the end of the "h1" construct and the start of a new "p" construct, in order to ensure line breaks between all complete HTML constructs, right in the Excel cells.
- Copy what in cell A1 to cell A2, directly below cell A1, to ensure multiple CSV rows (your file will have two formal CSV rows).
- First save the file as an xlsx file (e.g. "file.xlsx")
- Then save the worksheet as a CSV file (e.g. "file.csv").
You will now have an Excel generated CSV file that has two formal CSV rows, where each row will have multiple HTML constructs that are separated by line feeds, within it.
Reading the CSV File...
I use the following code to read CSV file and print the contents of each cell, both before and after I try to strip control characters...
arrayOfHtmlConstructs = CSV.read( file.csv )
arrayOfHtmlConstructs.each_with_index do | construct, i|
output = "" << construct.to_s
puts "BEFORE: " << output
output = output.gsub(/\r/, "") # Replace Microsoft carriage returns FAILS!
output = output.gsub(/\\"/, "\"") # Replace escaped quotes with quotes WORKS FINE!
output = output.gsub(/\[\"/, "") # Remove prefix [" WORKS FINE!
output = output.gsub(/\"\]/, "") # Remove suffix "] WORKS FINE!
puts "AFTER: " << output
end
Before trying to strip code, the CSV string "output" looks as follows...
BEFORE: ["<h1 style=\"text-align:center; font: bold 1.5em Arial;\">This is the Title</h1>\r<p style=\"text-align:center;\"><img style=\"width:300px; height:100px\" src=\"./IMAGES/MAIN/image1.png\" alt=\"Image 1\"/></p>\r<p style=\"text-align:center;\">This is a sentence.</p>"]
You'll notice that it includes [" at the beginning and ]" at the end, along with escaped quotes and embedded carriage returns /r
PROBLEM: All of the gsub statements work except for the one that tries to replace all carriage returns with blanks.
After running the Ruby script, the string "output" looks as follows, where everything gets substituted properly, except for the carriage returns...
AFTER: <h1 style="text-align:center; font: bold 1.5em Arial;">This is the Title</h1>\r<p style="text-align:center;"><img style="width:300px; height:100px" src="./IMAGES/MAIN/image1.png" alt="Image 1"/></p>\r<p style="text-align:center;">This is a sentence.</p>
For some reason, the carriage returns are NOT being replaced/substituted.
Also, before I upgraded to Ruby 2.0, I used to use FasterCSV and none of the substitution statements were needed. Everything just worked.
Any thoughts as to why this is all happening and how to properly handle it? Any assistance is greatly appreciated.
\rinside it does not have that removed when youstr.gsub(/\r/,'')? Is that the core of your problem, unrelated to CSV? If so, I cannot reproduce those results on ruby 2.0.0p353 on OS X Mavericks. Can you provide any way to reproduce your problem, simply?p File.open('my.csv', 'r:utf-8', &:read)so that we could reproduce by recreating your test file exactly. But please pare it down to the smallest file that causes the problems.