12

I have a list of rules in the form

L1 -> (A, B, C)

L2 -> (D, E),

L3 -> (F, G, A),

L4 -> (C, A)

.....

This list contains ~30k such rules.

I have an input in the form (X, Y, Z)

This creates a method

List <Rule> matchRules(input)

Which belongs to a class RuleMatcher

I started with a very simple clear naive solution, in order to get the framework down, get something working.

public RuleMatcher(Collection<Rule> rules) {
   this.rules = rules;
}

public Collection<Rule> matchRules(List<Token> input) {
   List<Rule> matchingRules = new ArrayList<>();
   for(Rule r: this.rules) {
        if(r.matches(input)) {
            matchingRules.add(r);
        }
   }
   return matchingRules; 
}

Where matches is a very simple function that checks if the lengths are the same, and then checks each token as a for loop.

This matchRules function is called in the magnitude of billions of times.


Obviously this is a very poor implementation. According to my profiler at least half of the execution time is is spent in this matches function.

I was thinking of two possible solutions:

A. Some sort of Trie data structure holding the chains of rules which could be matched.

B. some sort of hash function. Each symbol is given a unique identifier. Unfortunately, there are about 8 thousand unique symbols so this might be difficult.

C. Make a hashmap conditioning on the size of the right hand side, the number of tokens in the rule. Unfortunately, the majority of the rules are about the same size, so this may not even be worthwhile.

D. Some awesome solution one of you come up with.

I hope somebody can shed some light on this problem.


Edit: A token is just an object with a unique number. For example "NN" is a token. Each instance of "NN" is exactly the same.

Matches code:

public boolean rhsMatches(List<Token> tokens) {
   if(tokens.size()!=rhsSize()) return false;
   for(int i = 0;i<rhsSize();i++) {
      if(!rightSide.get(i).equals(tokens.get(i)) {
        return false;
      }
   }
   return true;
}

Its not very pretty, but its simple.

10
  • 2
    Could you give us the definition of the tokens. Without knowing what is being matched and how the matching is done it will be difficult to propose an optimization. Commented Jan 16, 2014 at 16:58
  • 1
    So you have 30,000 rules (L1, L2, ...) containing sets of 8,000 unique tokens (A, B, ...) correct? Have you considered creating a "reverse lookup table" (can't remember the actual name) where you index which rules the tokens are in? This may take a lot of memory, but speed should increase greatly. Commented Jan 16, 2014 at 17:00
  • You can use some another hash (like checksum) for keys, not only the length. And, yes, matches code would be helpful. Commented Jan 16, 2014 at 17:01
  • 1
    I'd say your idea of a TrieSet would be your best first-hit. Essentially - you need to build a grammar. Commented Jan 16, 2014 at 17:06
  • 1
    I think that HashMap<List<Token>, Rule> should work, if you override hashCode and equals methods on Token class. Trie would be far more efficient when it comes to memory usage Commented Jan 16, 2014 at 18:01

2 Answers 2

1

Why not sort your rule list to begin with. Then you can binary search for the matching rule.

Sign up to request clarification or add additional context in comments.

1 Comment

This would require the Rule to implement Comparable. That sounds like the easiest and fastest solution to me. Depending on how often items are added to the list, the list should be a SortedMap or SortedTree.
0

To me it looks like a perfect scenario for engaging some worker threads. Tasks of matching seem independent of each other, divide the list of rules and delegate the matching to workers, if its possible in your situation.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.