Regular expressions in Ruby

Question

I've got an external text file which looks like this:

This_ART is_P an_ART example_N.
Thus_KONJ this_ART is_P a_ART part_N of_PREP it_N.

Now I want to open this file in Ruby and make an Array with every annotated word. My attempt looks like this:

def get_entries(file)
  return File.open(file).map { |x| x.split(/\W+_[A-Z]+/) }
end

but the execution just returns an Array with each sentence as a member:

[["This_ART is_P an_ART example_N.\n"],["Thus_KONJ this_ART is_P a_ART part_N of PREP it_N.\n"]]

The punctuation and the escape characters are included. Where is the mistake or what do I have to change to get the correct array?

Could you give an example of what you want the matched output to be? — kafuchau
– kafuchau, Commented May 5, 2011 at 15:33

Jon Jensen · Accepted Answer · 2011-05-05 15:40:55Z

1

try scanning for just the ones you want, e.g.

return File.read(file).scan(/\w+_[A-Z]+/)

that will give you something like:

["This_ART", "is_P", "an_ART", "example_N", "Thus_KONJ", ...]

if you want the annotation part removed, you could tack on:

.map{ |w| w.gsub(/_[A-Z]+\z/, '') }

note that \w is word chars and \W is non-word chars

answered May 5, 2011 at 15:32

Jon Jensen

6855 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Yes this did it, I actualy wasn't precise enough but I ment the first solution, thank you!

Tim Pietzcker · Accepted Answer · 2011-05-05 15:32:59Z

0

/\W+_[A-Z]+/

matches only if there is a non-word character before the _, which isn't the case in your string.

I don't know exactly what you're expecting as a result, but try this:

/_[A-Z]+\W*/

Splitting along this regex gives you

["This", "is", "an", "example", "Thus", "this", "is", "a", "part", "of", "it"]

answered May 5, 2011 at 15:32

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges