4

I'm trying to split the string:

"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"

into the following array:

[
  ["test","blah"]
  ["foo","bar bar bar"]
  ["test","abc","123","456 789"]
]

I tried the following, but it isn't quite right:

"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
.scan(/\[(.*?)\s*\|\s*(.*?)\]/)
# =>
# [
#   ["test", "blah"]
#   ["foo", "bar bar bar"]
#   ["test", "abc |123 | 456 789"]
# ]

I need to split at every pipe instead of the first pipe. What would be the correct regular expression to achieve this?

1

4 Answers 4

7
 s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
 arr = s.scan(/\[(.*?)\]/).map {|m| m[0].split(/ *\| */)}
Sign up to request clarification or add additional context in comments.

2 Comments

This is the best answer. It uses scan and split in the write place.
All the answers are great but this looks like the simplest solution.
6

Two alternatives:

s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"

s.split(/\s*\n\s*/).map{ |p| p.scan(/[^|\[\]]+/).map(&:strip) }
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

irb> s.split(/\s*\n\s*/).map do |line|
  line.sub(/^\s*\[\s*/,'').sub(/\s*\]\s*$/,'').split(/\s*\|\s*/)
end
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

Both of them start by splitting on newlines (throwing away surrounding whitespace).

The first one then splits each chunk by looking for anything that is not a [, |, or ] and then throws away extra whitespace (calling strip on each).

The second one then throws away leading [ and trailing ] (with whitespace) and then splits on | (with whitespace).


You cannot get the final result you want with a single scan. About the closest you can get is this:

s.scan /\[(?:([^|\]]+)\|)*([^|\]]+)\]/
#=> [["test", " blah"], ["foo ", "bar bar bar"], ["123 ", " 456 789"]]

…which drops information, or this:

s.scan /\[((?:[^|\]]+\|)*[^|\]]+)\]/
#=> [["test| blah"], ["foo |bar bar bar"], ["test| abc |123 | 456 789"]]

…which captures the contents of each "array" as a single capture, or this:

s.scan /\[(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?([^|\]]+)\]/
#=> [["test", nil, nil, " blah"], ["foo ", nil, nil, "bar bar bar"], ["test", " abc ", "123 ", " 456 789"]]

…which is hardcoded to a maximum of four items, and inserts nil entries that you would need to .compact away.

There is no way to use Ruby's scan to take a regex like /(?:(aaa)b)+/ and get multiple captures for each time the repetition is matched.

Comments

2

Why the hard path (single regex)? Why not a simple combo of splits? Here are the steps, to visualize the process.

str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"

arr = str.split("\n").map(&:strip) # => ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"]
arr = arr.map{|s| s[1..-2] } # => ["test| blah", "foo |bar bar bar", "test| abc |123 | 456 789"]
arr = arr.map{|s| s.split('|').map(&:strip)} # => [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

This is likely far less efficient than scan, but at least it's simple :)

Comments

2

A "Scan, Split, Strip, and Delete" Train-Wreck

The whole premise seems flawed, since it assumes that you will always find alternation in your sub-arrays and that expressions won't contain character classes. Still, if that's the problem you really want to solve for, then this should do it.

First, str.scan( /\[.*?\]/ ) will net you three array elements, each containing pseudo-arrays. Then you map the sub-arrays, splitting on the alternation character. Each element of the sub-array is then stripped of whitespace, and the square brackets deleted. For example:

str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
str.scan( /\[.*?\]/ ).map { |arr| arr.split('|').map { |m| m.strip.delete '[]' }}

#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

Verbosely, Step-by-Step

Mapping nested arrays is not always intuitive, so I've unwound the train-wreck above into more procedural code for comparison. The results are identical, but the following may be easier to reason about.

string = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
array_of_strings = string.scan( /\[.*?\]/ )
#=> ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"]

sub_arrays = array_of_strings.map { |sub_array| sub_array.split('|') }
#=> [["[test", " blah]"],
#    ["[foo ", "bar bar bar]"],
#    ["[test", " abc ", "123 ", " 456 789]"]]

stripped_sub_arrays = sub_arrays.map { |sub_array| sub_array.map(&:strip) }
#=> [["[test", "blah]"],
#    ["[foo", "bar bar bar]"],
#    ["[test", "abc", "123", "456 789]"]]

sub_arrays_without_brackets =
  stripped_sub_arrays.map { |sub_array| sub_array.map {|elem| elem.delete '[]'} }
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.