3

I have this line as an example from a CSV file:

2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"

I want to split it into an array. The immediate thought is to just split on commas, but some of the strings have commas in them, eg "Life and Living Processes, Life Processes", and these should stay as single elements in the array. Note also that there's two commas with nothing in between - i want to get these as empty strings.

In other words, the array i want to get is

[2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes","","",1,0,"endofline"]

I can think of hacky ways involving eval but i'm hoping someone can come up with a clean regex to do it...

cheers, max

1
  • 1
    This is a perfect example of how not everything involved in extracting data from a string is a job for regexes. Commented Oct 14, 2010 at 14:53

6 Answers 6

9

This is not a suitable task for regular expressions. You need a CSV parser, and Ruby has one built in:

http://ruby-doc.org/stdlib/libdoc/csv/rdoc/classes/CSV.html

And an arguably superior 3rd part library:

http://fastercsv.rubyforge.org/

Sign up to request clarification or add additional context in comments.

4 Comments

I thought CSV could not cope with qualifiers?
FasterCSV is the default for Ruby 1.9.x, which allows you to specify a quote_char which might help in his case
What are "qualifiers"? This is a stock CSV line. No need to mess with quote_chars.
I agree that use of CSV methods is preferred, but that's not to say it can't be done with a regex.
4
str=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
require 'csv' # built in

p CSV.parse(str)
# That's it! However, empty fields appear as nil.
# Makes sense to me, but if you insist on empty strings then do something like:
parser = CSV.new(str)
parser.convert{|field| field.nil? ? "" : field}
p parser.readlines

1 Comment

Thanks Steenslag, this is perfect. As it happens i don't mind the empty fields coming through as nil. Cheers, max
2

EDIT: I failed to read the Ruby tag. The good news is, the guide will explain the theory behind building this, even if the language specifics aren't right. Sorry.

Here is a fantastic guide to doing this:

http://knab.ws/blog/index.php?/archives/10-CSV-file-parser-and-writer-in-C-Part-2.html

and the csv writer is here:

http://knab.ws/blog/index.php?/archives/3-CSV-file-parser-and-writer-in-C-Part-1.html

These examples cover the case of having a quoted literal in a csv (which may or may not contain a comma).

Comments

2
text=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
x=[]
text.chomp.split("\042").each_with_index do |y,i|
  i%2==0 ?  x<< y.split(",") : x<<y
end
print x.flatten

output

$ ruby test.rb
["2412", "21", "Which of the following is not found in all cells?", "Curriculum", "Life and Living Processes, Life Processes", "", "", "", "1", "0", "endofline"]

Comments

1

This morning I stumbled across a CSV Table Importer project for Ruby-on-Rails. Eventually you will find the code helpful:

Github TableImporter

Comments

0

My preference is @steenstag's solution, but an alternative is to use String#scan with the following regular expression.

r = /(?<![^,])(?:(?!")[^,\n]*(?<!")|"[^"\n]*")(?![^,])/

If the variable str holds the string given in the example, we obtain:

puts str.scan r

displays

2412
21
"Which of the following is not found in all cells?"
"Curriculum"
"Life and Living Processes, Life Processes"


1
0
"endofline"

Start your engine!

See also regex101 which provides a detailed explanation of each token of the regex. (Move your cursor across the regex.)

Ruby's regex engine performs the following operations.

(?<![^,]) : negative lookbehind assert current location is not preceded
            by a character other than a comma
(?:       : begin non-capture group
  (?!")   : negative lookahead asserts next char is not a double-quote
  [^,\n]* : match 0+ chars other than a comma and newline
  (?<!")  : negative lookbehind asserts preceding character is not a
            double-quote
  |       : or
  "       : match double-quote
  [^"\n]* : match 0+ chars other than double-quote and newline
  "       : match double-quote
)         : end of non-capture group
(?![^,])  : negative lookahead asserts current location is not followed
            by a character other than a comma

Note that (?<![^,]) is the same as (?<=,|^) and (?![^,]) is the same as (?=^|,).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.