0

I've got an external text file which looks like this:

This_ART is_P an_ART example_N.
Thus_KONJ this_ART is_P a_ART part_N of_PREP it_N.

Now I want to open this file in Ruby and make an Array with every annotated word. My attempt looks like this:

def get_entries(file)
  return File.open(file).map { |x| x.split(/\W+_[A-Z]+/) }
end

but the execution just returns an Array with each sentence as a member:

[["This_ART is_P an_ART example_N.\n"],["Thus_KONJ this_ART is_P a_ART part_N of PREP it_N.\n"]]

The punctuation and the escape characters are included. Where is the mistake or what do I have to change to get the correct array?

1
  • Could you give an example of what you want the matched output to be? Commented May 5, 2011 at 15:33

2 Answers 2

1

try scanning for just the ones you want, e.g.

return File.read(file).scan(/\w+_[A-Z]+/)

that will give you something like:

["This_ART", "is_P", "an_ART", "example_N", "Thus_KONJ", ...]

if you want the annotation part removed, you could tack on:

.map{ |w| w.gsub(/_[A-Z]+\z/, '') }

note that \w is word chars and \W is non-word chars

Sign up to request clarification or add additional context in comments.

1 Comment

Yes this did it, I actualy wasn't precise enough but I ment the first solution, thank you!
0
/\W+_[A-Z]+/

matches only if there is a non-word character before the _, which isn't the case in your string.

I don't know exactly what you're expecting as a result, but try this:

/_[A-Z]+\W*/

Splitting along this regex gives you

["This", "is", "an", "example", "Thus", "this", "is", "a", "part", "of", "it"]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.