Jan 19

<script src=”gaddag.js” />

Tag: Javascript,Programming,WebAbhijeet Maharana @ 2:45 pm

Words with Friends on Facebook is a recent addiction and I was wondering what it would take to automate the suggestion of valid words given a player’s rack and a letter on the board that the player wants to hook into. A quick search led me to the DAWG and GADDAG data structures which are most commonly used for problems like this. GADDAG is better suited for the Scrabble (or Words with Friends)  situation and is faster than a DAWG based approach even if it requires about 5 times more space on an average.

I spent some time reading Steven Gordon’s paper from 1994 in which he presented the GADDAG data structure. The link is at the end of this post. For understanding how the GADDAG data structure works, let’s look at the Trie (which can be considered a special kind of DAWG) based representation of the words CAR, CARE and CARREL:

In the image, EOW stands for End-of-word. Tracing the paths in the tree, one can arrive at the words represented by the tree. This saves space as it groups words with common prefixes which results in fewer nodes. If someone wants all words starting with ‘C’, they just have to traverse the subtree starting with ‘C’. This tree can also answer queries to find all words with a specific prefix. However, answering queries such as “all words containing E” is not as straightforward. While one can traverse the entire tree looking for subtrees with an ‘E’, it is not efficient and for large trees such as those that store entire dictionaries for word games like Scrabble, the search time would be very high. This is where the GADDAG comes in.

For representing a word, the GADDAG data structure has as many subtrees as letters in that word. In other words, as many subtrees as the number of prefixes in that word. For example, to represent the word CARE, the first subtree would be for C>ARE. Here, C is the prefix and > separates the prefix from the rest of the word. However, the subsequent subtrees would be for AC>RE, RAC>E and ERAC>, where the prefixes “CA”, “CAR” and “CARE” are reversed before being stored in the tree. This is better explained with a diagram:

To reconstruct the word, one reads the letters before > in reverse and the letters appearing after > in correct order. This is the key to GADDAG. For example, to find all words containing an ‘R’, one descends the subtree rooted at R, reads RAC  and reverses it to get CAR and then reads E, appending it to form CARE. Extending this to our original example of CAR, CARE and CARREL, we get the tree shown below.

This is the key to GADDAG. To find all words which contain ‘E’, one just has to descend the subtree rooted at ‘E’.

I wrote a Javascript based implementation of GADDAG which is based on John Resig’s work on Trie. Given our example words CAR, CARE and CARREL, John’s JSON Trie structure looks like:

{
    "C": {
        "A": {
            "R": {
                "$": 0,
                "E": 0,
                "R": {
                    "E": {
                        "L": 0
                    }
                }
            }
        }
    }
}

In this structure, 0 or $:0 is used to indicate EOW. The output JSON gets complicated really fast as the number of words increase. I used Chris Nielsen’s (amazing) JSON visualizer while implementing this. It is an indispensable tool if you do any JSON related work.

The JSON GADDAG output, which is too big to be include in its pretty form, is given below. Graphically, it looks like this.

{{"C":{">":{"A":{"R":{"$":0,"E":0,"R":{"E":{"L":0}}}}}},"A":{"C":{">":{"R":{"$":0,"E":0,"R":{"E":{"L":0}}}}}},"R":{"A":{"C":{">":{"$":0,"E":0,"R":{"E":{"L":0}}}}},"R":{"A":{"C":{">":{"E":{"L":0}}}}}},"E":{"R":{"A":{"C":{">":0}},"R":{"A":{"C":{">":{"L":0}}}}}},"L":{"E":{"R":{"R":{"A":{"C":{">":0}}}}}}}

The Trie “class” looks like this. I modified John’s code to pack the functions I need into one class. John’s blog post contains detailed description of the Trie implementation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// The Trie "class"
function Trie() {
 
    // Structure that actually holds the tree
    var trie = {};
 
    // Adds all words in the array passed as parameter
    this.addAll = function (words) {...}
 
    // Adds one word to the Trie
    this.add = function (word) {...}
 
    // Returns the internal tree structure
    this.getTrie = function () {...}
 
    // Returns all words contained in the Trie
    this.getWords = function () {...}
 
    // Returns the JSON string representation of the tree (like the one I posted above)
    this.getJson = function () {...}
}

Limitation (and a possible improvement) of addAll() is that input array must contain words in ascending order. My Gaddag class inherits from the Trie class.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
function Gaddag() {
 
    var separator = ">";
 
    // Adds a word to the GADDAG
    this.add = function (word) {...}
 
    // Finds all words containing a hook letter
    this.findWordsWithHook = function (hook) {...}
 
    // Find all words containing a hook letter which can be formed using the letters in the array passed as rack
    this.findWordsWithRackAndHook = function (rack, hook) {...}
 
    // Splits the input array at index and returns the character at that index and the array minus that character as an object
    this.spliceArray = function (arr, index) {
}
 
// Inherit from Trie
Gaddag.prototype = new Trie();

Adding a word to a GADDAG is only slightly different from adding it to a Trie. However, words containing duplicate letters are added as many times to the tree. This may be required for some queries to work (ones that involve length of prefix as a parameter) but for other cases, just one entry might be sufficient. This is one possible improvement depending on the usage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
    this.add = function (word) {
 
        if (word.length === 0) return;
 
        // add every (reversed-prefix + ">" + suffix) combination 
        for (var i = 1; i < word.length; i++) {
            var prefix, ch;        
            prefix = word.substring(0, i);
            ch = prefix.split('');
            ch.reverse();
 
            // call base class's add() to add this string to the tree
            Gaddag.prototype.add(ch.join('') + separator + word.substring(i));
        }
 
        // add the whole word as prefix
        ch = word.split('');
        ch.reverse();
        Gaddag.prototype.add(ch.join('') + separator + word.substring(i));
    };

The findWordsWithHook(hook) function is shown below. Explanation is inline.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
    this.findWordsWithHook = function (hook) {
        var trie = Gaddag.prototype.getTrie();
 
        // descend down the subtree containing the hook node
        var starterNode = trie[hook];
 
        // output array
        var words = [];
 
        if (typeof starterNode === 'undefined') return;
 
        // recurse through this subtree to collect all words
        dig(hook, starterNode, 'left');
        return words;
 
        /*
                word - word being currently formed
                cur - current node being traversed
                direction - direction of movement ('left' before > and 'right' after >)
        */ 
        function dig(word, cur, direction) {
 
            // for every subtree from current node
            for (var node in cur) {
 
                // get all child nodes
                var val = cur[ node ],
                    ch = (node === separator || node === "$" ? '' : node);
 
                // if we have reached the end of subtree, add complete word to output array
                if (val === 0) {
                    words.push(word + ch);
 
                } else {
 
                    // nodes after > form the suffix so change direction
                    if (node === separator) direction = 'right';
 
                    // add current letter to word
                    var part = (direction === 'left' ? ch + word : word + ch);
 
                    // traverse the remaining nodes to form rest of word
                    dig(part, val, direction);
 
                }
 
                // done with the previous subtree, reset direction to indicate we are in the prefix part of next subtree
                if (node === separator) direction = 'left';
            }
        }
    }

The findWordsWithRackAndHook(rack,hook) function is shown below. It is similar to the above function except that we have to loop through the rack and check if any of the child nodes appear in the rack before descending.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
this.findWordsWithRackAndHook = function (rack, hook) {
    var trie = Gaddag.prototype.getTrie();
    var words = [];
 
    /* To avoid recursing down duplicate characters more than once, sort the array and check whether we have already
     processed a letter before descending the subtree.
     */
    rack.sort();
 
    if (hook === '') {
        // each character in the rack acts as a hook with the following characters as the new rack
        while(rack.length > 1) {
            var h = rack.shift();
            findWordsRecurse("", rack, h, trie, 'reverse');
        }
    }
    else {
        findWordsRecurse("", rack, hook, trie, 'reverse');
    }
 
    return words.unique();
 
    function findWordsRecurse(word, rack, hook, cur, direction) {
        var hookNode = cur[ hook ];
 
        if (typeof hookNode === 'undefined') return;
 
        var hookCh = (hook === separator || hook === "$" ? '' : hook);
        word = (direction === "reverse" ? hookCh + word : word + hookCh);
 
        for (var nodeKey in hookNode) {
            var nodeVal = hookNode[ nodeKey ];
            var nodeCh = (nodeKey === separator || nodeKey === "$" ? '' : nodeKey);
 
            // if we have reached the end of this subtree, add the word (+ last character) to output array
            if (nodeVal === 0) {
                words.push(word + nodeCh);
            } else {
                // if this is the character separating the prefix, change direction and continue recursing
                if (nodeKey === separator) {
                    findWordsRecurse(word, rack, separator, hookNode, 'forward');
                }
                else {
                    processRack(word, rack, nodeKey, hookNode, direction);
                }
            }
        }
    }
 
    // descend down the next subtree that is rooted at any letter in the rack (which is not a duplicate)
    function processRack(word, rack, nodeKey, hookNode, direction) {
        for (var i = 0; i < rack.length; i++) {
            if (nodeKey === rack[i]) {
                var duplicate = (i > 0 ? (rack[i] === rack[i - 1] ? true : false) : false);
                if (!duplicate) {
                    var newRack = rack.slice(0);
                    newRack.remove(i);
                    findWordsRecurse(word, newRack, nodeKey, hookNode, direction);
                }
            }
        }
    }
}

These functions are not as straightforward as the underlying concept of GADDAG appears to be and it took me quite some time to get them working. I will be tweaking the code over the next few days and test it with full dictionaries. The code can be downloaded here. You can either run test-gaddag.js from the prompt using node.js or load test-gaddag.html in the browser.

Update (1/23/2012):
1. Bug fixes
2. Added check for start of game (no hook)
3. Added check to avoid processing duplicate letters in rack more than once

Update (3/27/2013):
James had reported a bug (see comment #10 below) long back. The tree seems to be constructed correctly and the problem was with the traversal code. Specifically lines 37,38 and 39 of findWordsWithRackAndHook.

Replace

37
38
39
            if (nodeVal === 0) {
                words.push(word + nodeCh);
            } else {

with

37
38
39
40
41
42
43
44
            if (nodeVal === 0) {
                if(nodeCh != '' && rack.indexOf(nodeCh) === -1) {
			continue;
		}
		else {
			words.push(word + nodeCh);
		}
            } else {

Before adding the last character to the word, we need to check if it is in the rack. If it is not present in the rack, we skip that character and move to the next one at the same level. I have not updated the downloaded files. Make sure you edit gaddag.js and include this fix.

Thanks for reporting the bug James!

13 Responses to “<script src=”gaddag.js” />”

  1. aphdstudent says:

    Randomly stumbled on this post from Google. I’m also working on such a side project. Might I make a few suggestions:

    A trie only combines nodes of words that share prefixes, which is why a DAWG or GADDAG is needed in the first place if you want to store efficient dictionary information. The DAWG is a trie that also combines nodes of words that share suffixes as well as prefixes, greatly reducing the space necessary to store the dictionary. While the GADDAG boasts a 2x speedup and claims only a 5x increase in storage space, that’s only if you’re storing the minimized version. Since you’re starting with a trie structure and just adding every possible reverse + separator combination of every possible word in the dictionary, the size of your GADDAG representation explodes exponentially (assuming you’re using a scrabble dictionary or something of the sort).

    The Gordon paper describes a nifty algorithm for constructing a semi-minimized GADDAG and I’ve found it’s really easy to follow. (page 225 of the PDF at http://www.ericsink.com/downloads/faster-scrabble-gordon.pdf). This will GREATLY reduce the space requirement necessary, meaning you might be able to store such a representation efficiently in a flat file and wouldn’t have to reconstruct it every time you wanted to generate moves.

    Just my 2 cents.

  2. Abhijeet Maharana says:

    @aphdstudent: You are spot on. Space requirement will be reduced greatly with the minimization outlined in Gordon’s paper. John Resig’s Trie implementation, which this code is based on, also has common suffix handling functionality. When I sit down next to work on this further, Ill look at integrating it. Thanks for pointing it out.

  3. aphdstudent says:

    Oh my bad, I thought the included Trie.js was Resig’s implementation.

    Also the GADDAG minimization that’s outlined only creates the semi-minimized version. Gordon briefly says what might be done to minimize completely, but I haven’t gotten that figured out yet.

    Considering the DAWG paper was written 23 years ago and the GADDAG paper 18 years ago, do you think advances in execution time negate the 2x benefit when considering the 5x increase in space complexity? I feel like keeping the dictionary small is the bigger benefit, given that it might be transmitted to and/or from the server, especially given Resig’s analysis of how the mobile browser frequently crashed under the load of a DAWG. Any thoughts? Do you think a PHP implementation might be more feasible and just use AJAX to send the request off and get the word list back?

  4. Abhijeet Maharana says:

    I modified the Trie implementation and removed the optimizations that I didn’t want to get into at the time (my focus was figuring GADDAG out as I worked through the code). Faster move generation aides strategy (page 220 in Gordon’s paper) while staying within reasonable time limits. For standard games, this may not be an issue anymore given the increase in computing power but for time-constrained variations of the game, it may make a difference in the number of searches made by the algorithm for generating the next move which may not result in the best score immediately.

    Execution time apart, I like the elegance of the algorithm. For finding words which fit into an ‘E’ on the board, just descend down the ‘E’ subtree … I like simple explanations! Sending the whole dictionary to the client could be advantageous in some situations (like highlighting a wrong tile on the board as it is being placed) but I prefer the Ajax variation because even a basic client can run the app successfully (like the mobile browser). One such implementation is http://www.wineverygame.com/. Words with Friends also takes a similar approach.

  5. aphdstudent says:

    Gotcha. So I attempted to implement the GADDAG a while back in PHP, and even the minimized version took A TON of memory (read: >60MB), due to how PHP handles all arrays as associative key/value pairs anyway (or maybe I just had memory leaks, I’m not used to tracking down web-based leaks). How would you implement it? Store as a flat binary file and read in what you needed?

  6. Abhijeet Maharana says:

    I used the test file from https://github.com/jeresig/trie-js/blob/master/dict/binary.txt and ended up with an output gaddag of 12.7MB. I attempted a Google chrome extension and the load time is undesirable. If this was a one-time event, one could live with it for now but the fact that Chrome reloads the extension page every time I click the button makes it practically useless at this point. I haven’t gotten around to exploring what Chrome offers to deal with this. And I am guessing reading in required parts from a tree stored on disk will require a non-trivial subroutine.

  7. Abhijeet Maharana says:

    I linked to the wrong file above. The correct test file is https://github.com/jeresig/trie-js/blob/master/dict/string.txt

  8. aphdstudent says:

    Following Resig’s citations and post comments, I found this version of a DAWG that uses a string representation for the whole trie (eliminating object/array overhead).

    http://ejohn.org/blog/javascript-trie-performance-analysis/#comment-392167

    Supposedly, the total file size of the official scrabble players dictionary (version 3) is < 200K, and the accompanying JS code can do searches and queries on the string itself. I feel like this reduced file size is perfect for web-based interaction (ESPECIALLY if 180K is feasible for storage on a phone browser), even if you get the non-deterministic left-hand searching (which you would have avoided with the GADDAG).

    I'm definitely going to dive into this compact representation and see what I can do (maybe use the representation to encode a GADDAG)?

  9. Abhijeet Maharana says:

    That looks great. And pretty neat documentation too.

  10. James says:

    I’m not sure if this is still under development, but I found a bug with the code. I started with the sample code in test-gaddag.js and tried the following line with the same library: gaddag.findWordsWithRackAndHook(['A','C','R'], ”).join(‘, ‘);

    This should only output A and CAR, but it also outputs AT, CARE, and CAT. Most of the letter combinations I’ve used for the rack have yielded the correct results, but for several combinations like this one, incorrect words are being returned. I’m still trying to figure out if the GADDAG structure itself is being build incorrectly or if the code is traversing the structure incorrectly. I just thought I’d bring this issue up in the hopes that it can be fixed.

  11. Abhijeet Maharana says:

    James, thanks for sharing your finding. I will likely not be able to get to this as I have moved on to other projects. Will appreciate an update if you are going to be spending more time on this.

  12. Noopur says:

    Hi Abhijeet, I am trying to implement GADDAG in java as I am creating a Desktop application for scrabble for my Masters Project. I am really confused about how I should go about storing the tree in Java.

  13. Abhijeet Maharana says:

    @Noopur: The tree is stored as a key-value pair. For a first-stab conversion of the JS code to Java code, Java’s HashMap structure could be used.

    @James: I spent some time today dwelling on your bug and found that the traversal is being done incorrectly. I have updated the post with the fix. Thanks for reporting it!

    Cheers,
    Abhijeet

Leave a Reply