2

So, I've got a word analyzing program in Excel with which I hope to be able to import over 30 million words.

At first,I created a separate object for each of these words so that each word has a...

.value '(string), the actual word itself
.bool1 '(boolean) 
.bool2 '(boolean)
.bool3 '(boolean)
.isUsed '(boolean)
.cancel '(boolean)

When I found out I may have 30 million of these objects (all stored in a single collection), I thought that this could be a monster to compile. And so I decided that all my words would be strings, and that I would stick them into an array.

So my array idea is to append each of the 30 million strings by adding 5 spaces (for my 5 bools) at the beginning of each string, with each empty space representing a false bool val. e.g,

If instr(3, arr(n), " ") = 1 then   
'my 3rd bool val is false. 
Elseif instr(3, arr(n), "*") = 1 then '(I'll insert a '*' to denote true) 
'my third bool val is true.
End If

Anyway, what do you guys think? Which way (collection or array) should I go about this (for optimization specifically)?

3
  • I suggest you use something other than Excel Commented Sep 8, 2015 at 1:19
  • 1
    Ahh, I'm pretty new to programming in general, but this can't be done in Excel? Honestly I would be okay with leaving my computer on for a few days if I needed to in order to complete the operation. I never really thought about it but perhaps Excel is not the best fit. Commented Sep 8, 2015 at 1:32
  • 2
    Great question and approach! Commented Sep 8, 2015 at 2:12

1 Answer 1

2

(I wanted to make this a comment but it became too long)

An answer would depend on how you want to access and process the words, once stored.

There are significant benefits and distinct advantages for 3 candidates:

  1. Arrays are very efficient to populate and retrieve all items at once (ex. range to array and array back to range), but much slower at re-sizing and inserting items in the middle. Each Redim copies the entire memory block to a larger location, and if Preserve is used, all values copied over as well. This may translate to perceived slowness for every operation (in a potential application)

    • More details (arrays vs collections) here (VB specific but it applies to VBA as well)
  2. Collections are linked lists with hash-tables - quite slow to populate but after that you get instant access to any element in the collection, and just as fast at reordering (sorting) and re-sizing. This can translate into a slow opening file, but all other operations are instant. Other aspects:

    • Retrieve keys as well as the items associated with those keys
    • Handle case-sensitive keys
    • Items can be other collections, arrays, objects
    • While keys must be unique, they are also optional
    • An item can be returned in reference to its key, or in reference to its index value
    • Keys are always strings, and always case insensitive
    • Items are accessible and retrievable, but its keys are not
    • Cannot remove all items at once (either one by one, or destroy then recreate the Collection
    • Enumerating with For...Each...Next, lists all items

    • More info here and here

  3. Dictionaries: same as collections but with the extra benefit of the .Exists() method which, in some scenarios, makes them much faster than collections. Other aspects:

    • Keys are mandatory and always unique to that Dictionary
    • An item can only be returned in reference to its key
    • The key can take any data type; for string keys, by default a Dictionary is case sensitive
    • Exists() method to test for the existence of a particular key (and item)

      • Collections have no similar test; instead, you must attempt to retrieve a value from the Collection, and handle the resulting error if the key is not found
    • Items AND keys are always accessible and retrievable to the developer
    • Item property is read/write, so it allows changing the item associated with a particular key
    • Allows you to remove all items in a single step without destroying the Dictionary itself
    • Using For...Each...Next dictionaries will enumerate the keys
    • A Dictionary supports implicit adding of an item using the Item property.

      • In Collections, items must be added explicitly
    • More details here


Other links: optimizing loops and optimizing strings (same site)

Sign up to request clarification or add additional context in comments.

12 Comments

Thank you for the response! Well, I actually only redim this array once at the very beginning (in order to set the UBound). Other than that I am only using it to create other lists, and compare those lists with its contents. I may change the actual strings in the arr (to change my secret bool val), but I won't need to remove a str or insert any new ones. I suppose to say which one is better is sort of speculative?
I do know that I'm going to check these bool vals many times for the whole arr. So, not sure if that instr operation will slow me down too much.
I'm trying to analyze a vast amount of legal depositions (which would be my 30m words). From that point on I am finding the most popular 2, 3, 4, and 5 word phrases (a lot of double for loops there). And then I'm running some searches (more looping) to eliminate phrases I don't want based on low frequency, and other factors. And that's about it. It's for court reporting, i.e. trying to improve my typing speed.
I see. This may be a challenge for Excel as each cell has limits on how many characters it can hold, so you may need to do all processing in memory, or create logic to split your phrases on the sheet (1,048,576 rows by 16,384 columns on an unlimited number of sheets). Either way, you'll probably be making extensive use of string functions; make sure you know their subtleties and parameters well, from my previous link
Do you notice any slowness? (or you didn't get to work with all data yet?) Not sure how much effort is required to convert all code, but ideally, you should convert an isolated functionality that can perform a task end-to-end, then measure the difference in performance between the approaches before committing to the full conversion
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.