3

I am writing a solver for a crossword puzzle that reads in a dictionary file and given a pattern returns a list of all words that fit that pattern. I have the functionality working but I need this to work faster. I create a HashMap with the length of words being the key and an ArrayList of the words as the value. Is there someway I can read through the ArrayList faster or is there some better data structure to use?

import java.util.*;

public class CWSolution {

    //create the data structure that will store the dictionary
    private HashMap<Integer,ArrayList<String>> db;

    public CWSolution(List<String> allWords)
    {

    //construct the background structure

        //Create hashmap
        db = new HashMap<Integer,ArrayList<String>>();

        //go through each word
        for(String item : allWords ){
            //if the db does not contain a listing for this word length, create one
            if(!db.containsKey(item.length())){
                ArrayList<String> temp = new ArrayList<String>();
                temp.add(item);
                db.put(item.length(), temp);
            }
            //if it does contain a listing for this word length, add this word to it
            else{
                ArrayList<String> temp = db.get(item.length());
                temp.add(item);
                db.put(item.length(), temp);
            }
        }
    }

    public List<String> solutions(String pattern, int maxRequired)
    {

        //actually look for each pattern

        //create the structures we need
        List<String> answer = new ArrayList<String>();

        //get the relevant array list
        ArrayList<String> temp = db.get(pattern.length());

        //go through the array list word by word
        for(String item : temp ){
            //see if the pattern matches the word, if it does add it to the list, otherwise skip it
            if(matchPattern(pattern, item)){
                answer.add(item);
            }
            //if we reach the required size return it, otherwise keep going
            if(answer.size() == maxRequired){
                return answer;
            }
        }
        return answer;
    }

    private boolean matchPattern(String pattern, String word){
        //TODO implement this function
        //check the word against the pattern
        char star = "*".charAt(0);
        for(int i=0;i<pattern.length();i++){
            if(pattern.charAt(i) != star){
                if(pattern.charAt(i) != word.charAt(i)){
                    return false;
                }
            }
        }
        return true;
    }
}

EDIT: Adding some more information to make this more clear.

Some of the comments were debating this so figured I'd clarify, I am a student in a Data Structures course so there is only so much that I know about, but we are nearing the end of the semester so I have a good idea of basic data structures.

Furthermore I am not as concerned about optimizing the CWSolution() method a I am with optimizing the solutions() method. The speed is being tested as follows and what I am really concerned with is Time2. That is how long it takes to find matching words rather than how long it takes to create the structure.

import java.util.Date;
import java.util.List;


public class CWSpeedTest {

    public static void main(String[] args){
    try{
        FileParser fp = new FileParser("TWL06.txt");
        List<String> solutions = null;
        //Change this to change the pattern
        String pattern = "*S**"; 
        //Change this to change the max solutions
        int maxSolns = 2000;

        List<String> dict = fp.getAllWords();

        Date d1 = new Date();
        CWSolution c = new CWSolution(dict);
        Date d2 = new Date();
        for (int i = 0; i < 1000; i++)
        solutions = c.solutions(pattern,maxSolns);
        Date d3 = new Date();
        System.out.println("Time 1: " + (d2.getTime() - d1.getTime()));
        System.out.println("Time 2: " + (d3.getTime() - d2.getTime()));
        System.out.println("For the pattern: " + pattern);
        System.out.println("With max solutions: " + maxSolns);
        System.out.println(solutions);

    }catch (Exception e){
        e.printStackTrace();
    }
    }
}
14
  • 1
    Off the top of my head: the data structure you probably want, after filtering by length, is something like a Map<Integer, Map<Character, Set<String>>> - the Set is all the words that have the given Character at the position given by the Integer. You then go over characters in the pattern, get the Sets from this map for all the specified characters, and successively do intersections of these sets. Commented Dec 10, 2013 at 21:46
  • 1
    I'd probably actually use a thin wrapper after this Map<Integer, Map<Integer, Map<Character, Set<String>>>>because that amount of angle brackets is painful. What you're really looking for is a multidimensional matrix. Alternatively, you could create a separate class DictionaryKey for the map key with the fields length, position, and character. Have your IDE generate a hashCode() using those fields, and then use a Map<DictionaryKey, Set<String>> to make the code more readable. This way you could microoptimise a bit by making DictionaryKey immutable and memoising the hash code. Commented Dec 10, 2013 at 22:00
  • 1
    @BorisBrodski That's what I mean. Say the map for length -> position -> character -> words is called db. You would store the word "puppy" under the length 5, and the position, character keys 0, 'p', 1, 'u', 2, 'p' etc. (Each word would be indexed by each character in the word, not just the starting word. That's precisely why the key is position, character.) Then if you're looking for words that match **ppy, you get the sets db[2]['p'], db[3]['p'], and db[4]['y']. I'd code it up but I'm busy ATM, and if this is homework doing so would be a great exercise for the OP ;) Commented Dec 10, 2013 at 22:36
  • 1
    As for why my approach is a significant improvement. The OP's code, if I'm reading it correctly, iterates over, say total_words/10 words. (Assuming that most of the words have a length between 3 and 13. The exact length distribution doesn't actually matter since my algorithm filters by word length too.) Then it performs word_length character comparisons for each of those words. This is possibly a large number if the dictionary is big. Commented Dec 10, 2013 at 22:44
  • 1
    To summarise, if N is the number of words in the dictionary, and m is the length of the current word, the OP's approach is O((N * m) / 10). Mine is O(N / 260) and some change. Seeing how m is likely to be modest it's the same complexity class, but still at least an order of magnitude faster. Commented Dec 10, 2013 at 22:58

2 Answers 2

3

Here is complete rewrite of the algorithm using indexing on all positions and characters. First this algorithm finds the shortest list of the words with a specified character at the specified position found in the pattern. Then it calculates cross section with all other lists of the words (one list per non-star character in the pattern).

import java.util.*;

public class CWSolution {

    class FixLengthDB {
        // Index -> Letter -> All word with the Letter at Index
        HashMap<Integer, HashMap<Character, Set<String>>> indexLetterDb = new HashMap<>();

        public void storeWord(String word) {
            int l = word.length();
            for (int i = 0; i < l; i++) {
                HashMap<Character, Set<String>> letterDb = indexLetterDb.get(i);
                if (letterDb == null) {
                    letterDb = new HashMap<>();
                    indexLetterDb.put(i, letterDb);
                }

                Set<String> list = letterDb.get(word.charAt(i));
                if (list == null) {
                    list = new HashSet<>();
                    letterDb.put(word.charAt(i), list);
                }

                list.add(word);
            }
        }

        public Set<String> getList(int i, char c) {
            HashMap<Character, Set<String>> letterDb = indexLetterDb.get(i);
            if (letterDb == null) {
                return null;
            }
            return letterDb.get(c);
        }
    }

    //create the data structure that will store the dictionary
    private HashMap<Integer,FixLengthDB> db = new HashMap<>();
    private List<String> allWords;

    public CWSolution(List<String> allWords)
    {

        //construct the background structure

        this.allWords = allWords;
        //go through each word
        for(String item : allWords) {
            FixLengthDB fixLengthDB = db.get(item.length());

            if (fixLengthDB == null) {
                fixLengthDB = new FixLengthDB();
                db.put(item.length(), fixLengthDB);
            }

            fixLengthDB.storeWord(item);
        }
    }

    public List<String> solutions(String pattern, int maxRequired)
    {

        FixLengthDB fixLengthDB = db.get(pattern.length());
        if (fixLengthDB == null) {
            return new ArrayList<>();
        }

        Set<String> shortList = null;
        int shortListIndex = 0;
        int l = pattern.length();
        for (int i = 0; i < l; i++) {
            if (pattern.charAt(i) == '*') {
                continue;
            }
            Set<String> set = fixLengthDB.getList(i, pattern.charAt(i));
            if (set == null) {
                return new ArrayList<>();
            }

            if (shortList == null || shortList.size() > set.size()) {
                shortList = set;
                shortListIndex = i;
            }
        }

        if (shortList == null) {
            return allWords;
        }

        HashSet<String> result = new HashSet<>(shortList);
        for (int i = 0; i < l; i++) {
            if (i == shortListIndex || pattern.charAt(i) == '*') {
                continue;
            }
            Set<String> set = fixLengthDB.getList(i, pattern.charAt(i));
            result.retainAll(set);
        }

            // TODO truncate result list according to 'maxRequired' parameter
    return new ArrayList<>(result);
    }
}

Explanation: The algorithm works in two steps

  • Build an index (in the constructor)
  • Use index to find matched words (solutions(...))

Build index: The index maintain the sets of string for each word-length/character-index/character.

Here how we add words to the index

Add word: fun
          |||
          ||\--- (length: 3, position 3, character 'n') -> set{"fun"})
          |\---- (length: 3, position 2, character 'u') -> set{"fun"})
          \----- (length: 3, position 1, character 'f') -> set{"fun"})

Add word: run
          |||
          ||\--- (length: 3, position 3, character 'n') -> set{"fun, run"})
          |\---- (length: 3, position 2, character 'u') -> set{"fun, run"})
          \----- (length: 3, position 1, character 'r') -> set{"run"})

Add word: raw
          |||
          ||\--- (length: 3, position 3, character 'w') -> set{"raw"})
          |\---- (length: 3, position 2, character 'a') -> set{"raw"})
          \----- (length: 3, position 1, character 'r') -> set{"run, raw"})

Add word: rar
          |||
          ||\--- (length: 3, position 3, character 'r') -> set{"rar"})
          |\---- (length: 3, position 2, character 'a') -> set{"raw, rar"})
          \----- (length: 3, position 1, character 'r') -> set{"run, raw, rar"})

The database after adding four words (fun, run, raw, rar) is

(length: 3, position 1, character 'f') -> set{"fun"})
(length: 3, position 1, character 'r') -> set{"run, raw, rar"})
(length: 3, position 2, character 'u') -> set{"fun, run"})
(length: 3, position 2, character 'a') -> set{"raw, rar"})
(length: 3, position 3, character 'w') -> set{"raw"})
(length: 3, position 3, character 'r') -> set{"rar"})
(length: 3, position 3, character 'n') -> set{"fun, run"})

Use index: How lets match the pattern ru*

First lets find the matching smallest set in the index. We have only 2 non-star characters, so we checking only two sets

1: (length: 3, position 1, character 'r') -> set{"run, raw, rar"})
2: (length: 3, position 2, character 'u') -> set{"fun, run"})

The smallest set is #2 {"fun, run"}. Now we iterate through all other matching sets (in our case the set #1) and calculate intersections:

{"fun, run"} cross {"run, raw, rar"} => {"run"}

The result is {"run"}.

Sign up to request clarification or add additional context in comments.

2 Comments

Could you please explain the algorithm you implemented for the solutions() method. I am having some trouble understanding everything it is doing.
Added algorithm explanation as requested
2

If you want to optimize for look-up speed, your ideal solution is where you take all that you know (everything about the pattern), and it gives you only the set of words that matches. Realistically, you are going to use some of what you know to narrow down to a set that is of acceptable size.

In the original code, you are using only one item that you know (i.e. the length) as the key. Millimoose's comments provide the correct answer: create a key that is more discriminating. For instance, suppose you had a two-field key: (Length, Character-Contained)... i.e. 1A, 1B, 1C,... 1Z, 2A... If each pointed to a set, each set would be smaller. You could then use length and any letter from your pattern to get to that set.

Taking it one step further, you could have Millimoose's a three-field key (Length, Position, Character). That way, you take any letter from your pattern, and using those three attributes you can narrow it down to an even smaller list. [As Millimoose points out, what is slowing you down is the String comparisons.]

In theory, you could go all the way and have a set for each possible pattern. For instance, the word "man" would fit patterns "m**","ma*","m*n","*a*","ma*","*an","**n","m*n" and "*an". Each of those could be the key in a map which points to a list (value) that contains the word "man". For instance, "m**" would point to a list that contained "man", "mob", "map", "mid" and so on.

As you do this, you might end up spending too much time initially, when you are constructing the data-structure. Also, you might end up not having enough space to save that data-structure. Those are the trade-offs.

In summary, from your current-key, consider adding more information to the key, while weighing the costs of doing so.

2 Comments

A set for each possible patterns would be a pretty bad waste of memory. The approach is equivalent to mine - there's no difference between a map from "m**" to a list of words, and a map from (0, 'm') to the same list. The former would use 2 ^ total_characters_in_dictionary index entries. The latter only uses total_characters_in_dictionary ones.
Sure, your approach was perfect.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.