0

I'm just learning Ruby and have been tackling small code projects to accelerate the process.

What I'm trying to do here is read only the alphabetic words from a text file into an array, then delete the words from the array that are less than 5 characters long. Then where the stdout is at the bottom, I'm intending to use the array. My code currently works, but is very very slow since it has to read the entire file, then individually check each element and delete the appropriate ones. This seems like it's doing too much work.

goal = File.read('big.txt').split(/\s/).map do |word|
    word.scan(/[[:alpha:]]+/).uniq
end

goal.each { |word|
    if word.length < 5
        goal.delete(word)
    end 
}

puts goal.sample

Is there a way to apply the criteria to my File.read block to keep it from mapping the short words to begin with? I'm open to anything that would help me speed this up.

3
  • Instead of storing everything and deleting what shouldn't be there, don't save it in the first place. This will be a huge improvement. Commented Nov 20, 2014 at 17:43
  • I answered below, but I have a remark on your style: you use both do..end (for map) and {..} (for each) as ways of passing in a block. In general, always use do..end unless the block is a single line (like {|word| word.upcase!}) Commented Nov 20, 2014 at 18:23
  • Changing the regexp to only match words bigger than 5 letters makes the code run ~35x faster in the testcase I did so... yeah :D Commented Nov 20, 2014 at 18:25

3 Answers 3

3

You might want to change your regex instead to catch only words longer than 5 characters to begin with:

goal = File.read('C:\Users\bkuhar\Documents\php\big.txt').split(/\s/).flat_map do |word|
  word.scan(/[[:alpha:]]{6,}/).uniq
end

Further optimization might be to maintain a Set instead of an Array, to avoid re-scanning for uniqueness:

goal = Set.new
File.read('C:\Users\bkuhar\Documents\php\big.txt').scan(/\b[[:alpha:]]{6,}\b/).each do |w| 
  goal << w
end
Sign up to request clarification or add additional context in comments.

3 Comments

The first example you provided works great, but instead of reading in words longer than 5 characters and ignoring shorter ones, it's reading shorter ones in as empty arrays into the 2D array goal.
I've changed map to flat_map. I've taken the map from your code, which will also produce a 2D array...
@Cary, thanks for your review! the first version is again me trying to show the solution of minimal change, so it might not be the code as I would have written it, but will be most familiar to the OP.
2

In this case, use the delete_if method

goal => your array
goal.delete_if{|w|w.length < 5}

This will return a new array with the words of length lower than 5 deleted.

Hope this helps.

1 Comment

...it won't improve the running time of the code... "My code currently works, but is very very slow since it has to read the entire file, then individually check each element and delete the appropriate ones"... I agree that using delete_if is the way to write the code above in an idiomatic way, but it does not answer the question...
1

I really don't understand what a lot of the stuff you are doing in the first loop is for.

You take every chunk of text separated by white space, and map it to a unique value in an array generated by chunking together groups of letter characters, and plug that into an array.

This is way too complicated for what you want. Try this:

goal = File.readlines('big.txt').select do |word|
  word =~ /^[a-zA-Z]+$/ &&
  word.length >= 5
end

This makes it easy to add new conditions, too. If the word can't contain 'q' or 'Q', for example:

goal = File.readlines('big.txt').select do |word|
  word =~ /^[a-zA-Z]+$/ &&
  word.length >= 5 &&
  ! word.upcase.include? 'Q'
end

This assumes that each word in your dictionary is on its own line. You could go back to splitting it on white space, but it makes me wonder if the file you are reading in is written, human-readable text; a.k.a, it has 'words' ending in periods or commas, like this sentence. In that case, splitting on whitespace will not work.

Another note - map is the wrong array function to use. It modifies the values in one array and creates another out of those values. You want to select certain values from an array, but not modify them. The Array#select method is what you want.

Also, feel free to modify the Regex back to using the :alpha: tag if you are expecting non-standard letter characters.


Edit: Second version

goal = /([a-z][a-z']{4,})/gi.match(File.readlines('big.txt').join(" "))[1..-1] 

Explanation: Load a file, and join all the lines in the file together with a space. Capture all occurences of a group of letters, at least 5 long and possibly containing but not starting with a '. Put all those occurences into an array. the [1..-1] discards "full match" returned by the MatchData object, which would be all the words appended together.

This works well, and it's only one line for your whole task, but it'll match

sugar'

in

I'd like some 'sugar', if you know what I mean

Like above, if your word can't contain q or Q, you could change the regex to

/[a-pr-z][a-pr-z']{4,})[ .'",]/i

And an idea - do another select on goal, removing all those entries that end with a '. This overcomes the limitations of my Regex

2 Comments

This does work great for a file with a word per line. The file I'm using is an excerpt from a Sherlock Holmes novel, which is why I was using white space to delimit the words. Is there a way to apply this approach to block text, because I agree that it's very likely that I'm vastly over complicating it by making an array of a thousand little arrays?
You start getting in to complicated regex when you consider tougher word cases - is doesn a word if doesn't appears in the text? If our regex is simple and we capture that ' then we might also capture the ' at the beginning of a quote, making 'The a word. Give me a little bit to work on something.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.