6

Hi I was wondering whether anyone could offer some advice on the fastest / most efficient way to compre two arrays of strings in javascript.

I am developing a kind of tag cloud type thing based on a users input - the input being in the form a written piece of text such as a blog article or the likes.

I therefore have an array that I keep of words to not include - is, a, the etc etc.

At the moment i am doing the following:

Remove all punctuation from the input string, tokenize it, compare each word to the exclude array and then remove any duplicates.

The comparisons are preformed by looping over each item in the exclude array for every word in the input text - this seems kind of brute force and is crashing internet explorer on arrays of more than a few hundred words.

i should also mention my exclude list has around 300 items.

Any help would really be appreciated.

Thanks

5 Answers 5

5

I'm not sure about the whole approach, but rather than building a huge array then iterating over it, why not put the "keys" into a map-"like" object for easier comparison?

e.g.

var excludes = {};//object
//set keys into the "map"
excludes['bad'] = true;
excludes['words'] = true;
excludes['exclude'] = true;
excludes['all'] = true;
excludes['these'] = true;

Then when you want to compare... just do

var wordsToTest = ['these','are','all','my','words','to','check','for'];
var checkWord;
for(var i=0;i<wordsToTest.length;i++){
  checkWord = wordsToTest[i];
  if(excludes[checkword]){
    //bad word, ignore...
  } else {
    //good word... do something with it
  }
}

allows these words through ['are','my','to','check','for']

Sign up to request clarification or add additional context in comments.

8 Comments

To prevent any chance of this going wrong because of augmentation of Object.prototype (for example, if a library has added an each method to Object.prototype, "each" will be considered a bad word in the example code), you could use jshashtable (timdown.co.uk/jshashtable).
this makes sense. I have implemented it and ti works great in firefox but it still crashes ie as it did before. I wonder if ie is known for problems like this or if my code can be improved any.
Edit: I have just tested my code in chrome, opera, firefox and safari and it works super fast. In ie it fails miserably and i have to restart the browser :(
@Tim Down: That's a very good reason NOT to use frameworks that mash the default namespace until it's unrecognizable.
@David - do you have a url where we can see the whole the thing? There may be something else tripping up IE.
|
2

It would be worth a try to combine the words into a single regex, and then compare with that. The regex engine's optimizations might allow the search to skip forward through the search text a lot more efficiently than you could do by iterating yourself over separate strings.

Comments

0

You could use a hashing function for strings (I don't know if JS has one but i'm sure uncle Google can help ;] ). Then you would calculate hashes for all the words in your exclude list and create an array af booleans indexed by those hashes. Then just iterate through the text and check the word hashes against that array.

2 Comments

thanks for the reply, I will definately look into that but how much faster can this be since you are still essentially iterating over the same number of elements the same number of times arent you?
Nope. The algorithm that you presented has O(nmk) complexity where n is the exclude list size, m - text size and k is the average number of operations in string comparison. The method im proposing has O(n) complexity for the initial hashing and O(m) for every comparison
0

I have taken scunliffe's answer and modified it as follows:

var excludes = ['bad','words','exclude','all','these']; //array

now lets prototype a function that checks if a value is inside an Array:

Array.prototype.hasValue= function(value) {
  for (var i=0; i<this.length; i++)
      if (this[i] === value) return true; 
  return false;
}

lets test some words:

var wordsToTest = ['these','are','all','my','words','to','check','for'];
var checkWord;
for(var i=0; i< wordsToTest.length; i++){
  checkWord = wordsToTest[i];
  if( excludes.hasValue(checkWord) ){
    //is bad word
  } else {
    //is good word
    console.log( checkWord );
  }
}

output:

['are','my','to','check','for']

Comments

0

I'd opt for the regex version

text = 'This is a text that contains the words to delete. It has some <b>HTML</b> code in it, and punctuation!';
deleteWords = ['is', 'a', 'that', 'the', 'to', 'this', 'it', 'in', 'and', 'has'];

// clear punctuation and HTML code
onlyWordsReg = /\<[^>]*\>|\W/g;
onlyWordsText = text.replace(onlyWordsReg, ' ');

reg = new RegExp('\\b' + deleteWords.join('\\b|\\b') + '\\b', 'ig');
cleanText = onlyWordsText .replace(reg, '');

// tokenize after this

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.