1

I'm trying to parse a file containing a name followed by a hierarchy path. I want to take the named regex matches, turn them into Hash keys, and store the match as a hash. Each hash will get pushed to an array (so I'll end up with an array of hashes after parsing the entire file. This part of the code is working except now I need to handle bad paths with duplicated hierarchy (top_* is always the top level). It appears that if I'm using named backreferences in Ruby I need to name all of the backreferences. I have gotten the match working in Rubular but now I have the p1 backreference in my resultant hash.

Question: What's the easiest way to not include the p1 key/value pair in the hash? My method is used in other places so we can't assume that p1 always exists. Am I stuck with dropping each key/value pair in the array after calling the s_ary_to_hash method?

NOTE: I'm keeping this question to try and solve the specific issue of ignoring certain hash keys in my method. The regex issue is now in this ticket: Ruby regex - using optional named backreferences

UPDATE: Regex issue is solved, the hier is now always stored in the named 'hier' group. The only item remaining is to figure out how to drop the 'p1' key/value if it exists prior to creating the Hash.

Example file:

name1 top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
new12 top_ab12/hat[1]/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
tops  top_bat/car[0]
ab123 top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog

Expected output:

[{:name => "name1", :hier => "top_cat/mouse/dog/elephant/horse"},
 {:name => "new12", :hier => "top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool"},
 {:name => "tops",  :hier => "top_bat/car[0]"},
 {:name => "ab123", :hier => "top_2/top_1/top_3/top_4/dog"}]

Code snippet:

def s_ary_to_hash(ary, regex)
  retary = Array.new
  ary.each {|x| (retary << Hash[regex.match(x).names.map{|key| key.to_sym}.zip(regex.match(x).captures)]) if regex.match(x)}
  return retary
end

regex = %r{(?<name>\w+) (?<p1>[\w\/\[\]]+)?(?<hier>(\k<p1>.*)|((?<= ).*$))}
h_ary = s_ary_to_hash(File.readlines(filename), regex)
7
  • 1
    Do you have a .html/.xml file? If so , please use nokogiri. Commented Jan 30, 2014 at 19:31
  • Agreed, but this is not HTML or XML... it's a dump from another program that I can't touch. Commented Jan 30, 2014 at 19:37
  • Greg, regardless of the regex you use, consider replacing the three lines of s_ary_to_hash with ary.each_with_object([]) { |x, retry| .... }. Commented Jan 30, 2014 at 20:38
  • Good tip, I had forgotten about the |x, retry| way of collecting results. Thanks! Commented Jan 30, 2014 at 21:02
  • @Greg, when I test your code, I don't get the last line of your expected output. Could you check that? Commented Jan 30, 2014 at 22:34

3 Answers 3

2

What about this regex ?

^(?<name>\S+)\s+(?<p1>top_.+?)(?:\/(?<hier>\k<p1>(?:\[.+?\])?.+))?$

Demo

http://rubular.com/r/awEP9Mz1kB

Sample code

def s_ary_to_hash(ary, regex, mappings)
   retary = Array.new

   for item in ary
      tmp = regex.match(item)
      if tmp then
         hash = Hash.new
         retary.push(hash)
         mappings.each { |mapping|
            mapping.map { |key, groups|
              for group in group
                 if tmp[group] then
                     hash[key] = tmp[group]
                     break
                 end
              end 
            }
         }
      end
   end

  return retary
end

regex = %r{^(?<name>\S+)\s+(?<p1>top_.+?)(?:\/(?<hier>\k<p1>(?:\[.+?\])?.+))?$}
h_ary = s_ary_to_hash(
   File.readlines(filename), 
   regex,
   [ 
      {:name => ['name']},
      {:hier => ['hier','p1']}
   ]
)

puts h_ary

Output

{:name=>"name1", :hier=>"top_cat/mouse/dog/elephant/horse\r"}
{:name=>"new12", :hier=>"top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool\r"}
{:name=>"tops", :hier=>"top_bat/car[0]"}

Discussion

Since Ruby 2.0.0 doesn't support branch reset, I have built a solution that add some more power to the s_ary_to_hash function. It now admits a third parameter indicating how to build the final array of hashes.

This third parameter is an array of hashes. Each hash in this array has one key (K) corresponding to the key in the final array of hashes. K is associated with an array containing the named group to use from the passed regex (second parameter of s_ary_to_hash function).

If a group equals nil, s_ary_to_hash skips it for the next group.

If all groups equal nil, K is not pushed on the final array of hashes. Feel free to modify s_ary_to_hash if this isn't a desired behavior.

Sign up to request clarification or add additional context in comments.

6 Comments

Alex, when I run this I get an array of three hashes. The first two are the same as in "Expected output". The third, however, is {:name=>"tops", :hier=>"top_bat/car[0]"}]. Greg, is the "Expected output" correct?
@CarySwoveland do you need to drop top_ in the third hash ?
Oops, I had a type in my expected output. It's corrected now.
Alex, that regex doesn't meet the criteria of dropping the duplicated hierarchy.
Hi, :p1 still appears in my Hash (tested in irb). My regex functionality is not the problem, it's that it's storing the p1 key/value pair in my hash. Original question --> What's the easiest way to not include the p1 key/value pair in the hash
|
0

Edit: I've changed the method s_ary_to_hash to conform with what I now understand to be the criterion for excluding directories, namely, directory d is to be excluded if there is a downstream directory with the same name, or the same name followed by a non-negative integer in brackets. I've applied that to all directories, though I made have misunderstood the question; perhaps it should apply to the first.

data =<<THE_END
name1 top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
new12 top_ab12/hat/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
tops  top_bat/car[0]
ab123 top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
THE_END

text = data.split("\n")

def s_ary_to_hash(ary)
  ary.map do |s| 
    name, _, downstream_path = s.partition(' ').map(&:strip)
    arr = []
    downstream_dirs = downstream_path.split('/')
    downstream_dirs.each {|d| puts "'#{d}'"}
    while downstream_dirs.any? do
      dir = downstream_dirs.shift
      arr << dir unless downstream_dirs.any? { |d|
        d == dir || d =~ /#{dir}\[\d+\]/ }
    end     
    { name: name, hier: arr.join('/') }
  end   
end

s_ary_to_hash(text)
  # => [{:name=>"name1", :hier=>"top_cat/mouse/dog/elephant/horse"},
  #     {:name=>"new12", :hier=>"top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool"},
  #     {:name=>"tops", :hier=>"top_bat/car[0]"},
  #     {:name=>"ab123", :hier=>"top_2/top_1/top_3/top_4/dog"}] 

The exclusion criterion is implement in downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\[\d+\]/ }, where dir is the directory that is being tested and downstream_dirs is an array of all the downstream directories. (When dir is the last directory, downstream_dirs is empty.) Localizing it in this way makes it easy to test and change the exclusion criterion. You could shorten this to a single regex and/or make it a method:

dir exclude_dir?(dir, downstream_dirs)
  downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\[\d+\]/ }end
end

1 Comment

Hi Cary. I want to take the 1st element in the 2nd string (elements delineated by slashes) and see if it exists anywhere in the remaining string. If it does, I want my returned match to contain everything from that 2nd location onwards. If it does not, I want my match to contain the entire original 2nd string.
0

Here is a non regexp solution:

result = string.each_line.map do |line|
  name, path = line.split(' ')
  path = path.split('/')
  last_occur_of_root = path.rindex(path.first)
  path = path[last_occur_of_root..-1]
  {name: name, heir: path.join('/')}
end

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.