Word List

Status: MATCH and ANAGRAM completed

The word list provides an efficient way to store and search for words for the editor.

Requirements

We need to support the following use cases:

We support two filter modes:
- MATCH: Filter by letters. For example, "?OR??" will match WORDS and CORES, among others.
- ANAGRAM: Find anagrams out of a set of letters. So CAT will return ACT, CAT, TAC, and TCA
Efficient lookup of words when filling in a crossword. We want to avoid linear searches through the list of words.
Multiple concurrent instances of the word list.
Internationalized word lists.

We also need this to be as fast as possible. In the Crosword Editor, we use matches to efficiently build puzzles recursively in order to find possible solutions for sections of the puzzle.

We never need to have a list of words of mixed word LENGTH. It’s always sorted by word length. Eg, we never have any list with both CAT with MOUSE in it. Also, note that when we talk about length, we mean the number of code points for a given word, and not the number of bytes. So AÑO has a length of three, despite being four bytes long (encoded in UTF-8).

Overall approach

We store the word data in an mmappable() file. This file is generated at compile time, and loaded as a GResource. The file consists of four distinct blocks of data, each described in greater detail down below:

Word List – A list of all the words in the data set. These words are bucketed by word length in different subsections. Within each bucket they are stored first by priority, and then alphabetically.
Filter Fragment List – A list of words, indexed by letter, sorted by word length. This is used to MATCH filters. For example: The three letter word subsection would have all the words that match "A??" followed by all the words matching "B??", all the way up through all the words matching "??Z".
Hash List – A sorted table of hashes of each word sorted. This is used by the anagram mode.
Index Section – The table of contents for the file. It will have byte offsets for each section and subsection of the file. It’s located at the end of the file, and is stored as a json block.

Everything in the list will be indexed by a byte offset into the data file. Some of the byte-offsets are absolute; some are relative. We mmap the file and seek through it to find what we need.

MATCH mode approach

When in MATCH mode, we decompose the given filter into different filter fragments. A filter fragment is a pilter pattern with one character listed, and is precalculated in the Filter Fragment List.

So, for example, consider the filter ??MO??T. For this filter, we break it up into three filter fragments: ??M????, ???O???, and ??????T. We build a list out of the intersection of each filter fragments words. This intersection operation is quick to be done, as the list is sorted so we only have to take one pass through each filter.

As special cases, we can bypass the intersection when our filter has 0 or 1 character. When the filter is only has ?s we can just use the Word List bucket for that word length. When we are looking for a filter pattern with one character selected (such as ??M????) the Letter Index can be used.

ANAGRAM mode approach

In order to find all the anagrams of the filter, we first sort each word’s letters and hash the result. For each hash, we store a list of the words that share that hash. That gives us a quick and efficient way of very quickly finding all anagrams for a word. There are hash-collisions in our word list, but not very many. That means that when we look up the words for the hash, we need to double check the results are anagrams.

NOTE One tradeoff of this approach versus a traditional trie-based approach is that we can’t easily handle anagrams with unknown letters. That means that we can’t easily generate a list of words whose anagrams match CDO?.

API

WordList is a GObject that implements a GListModel-like API. It has a stateful “filter” property that determines what words it makes available. The filter defaults to "" – the empty word list.

The value of the filter depends on the mode of the word list. Question marks are not parsed as an unknown word in ANAGRAM mode.

typedef enum
{
  WORD_LIST_NONE,
  WORD_LIST_MATCH,
  WORD_LIST_ANAGRAM,
} WordListMode;

WordList    *word_list_new                  (void);
void         word_list_set_filter           (WordList     *word_list,
                                             const gchar  *filter);
                                             WordListMode  mode);
guint        word_list_get_n_items          (WordList     *word_list);
const gchar *word_list_get_word             (WordList     *word_list,
                                             guint         position);
gint         word_list_get_priority         (WordList     *word_list,
                                             guint         position);

The priority of each word is a value between 0 and 255, and defaults to 50.

Like the GListModel interface, this lets the user get the number of words (per filter) and get the word/priority for each position. However, unlike GListModel it doesn’t emit signals and doesn’t return GObjects for performance reasons. That’s a relatively simple task to do, and WordListModel is a wrapper around WordList that provides this interface.

Note that WordList is completely stable. It will always return the same answers for a given filter. This means that you can set the filter to be some value and iterate through the items, change the filter, and then set it back and continue your iteration.

Data Sections

As mentioned, the overall resource file is divided into four sections of data:

Word List
Filter Fragments
- Letter List
- Filter Index
Anagram Hash Table
- Anagram Words
- Anagram Hash Index
Index Section

Each is described in detail below.

Word List Sections

This block stores all the words along with their priority. The block is divided into multiple Word List Sections – one for each word length. So, for example, there’s a section for all the words that are three characters long, followed by a section for all the words that are four characters long, etc.

Each word entry in a section consists of a 1-byte priority stored as an unsigned char, followed by a UTF-8 string terminated by the null character. It will be padded out by ‘\0’ characters to fit the longest word (byte-wise) word within the section, stored as STRIDE. See the charset section below for a little more clarification.

Filter Fragment section

The Filter Fragment section contains two blocks of memory. The first block contains the Letter Lists, which are basically a big block of gushorts. This is a list of all the indexes for a WordIndex for a filter fragment concatenated together. It’s not possible to find out any state from within the block, and it’s not possible to calculate anything about it. They’re in a predetermined order, but there’s nothing special about the order.

The Letter List is also large; we store a gushort per letter per word, which means it is twice as big as the Word List (for English). As an example, for the fragment ??T we store the offset of every word in the three-letter Word List block that matches that pattern. That will be immediately followed by a list of all the words that match ??U, and so forth.

The second part of the Filter Fragment section is the Filter index for the Letter List block. This is a segment of geometrically increasing set of offsets to determine the section of the Letter List contains the list of words. We store an guint for the byte offset for each list, and a gushort for the length of each list.

The Filter Fragment section is super confusing to describe, so here’s an example. Consider the query "??T":

Start at Length = MIN-LENGTH

   +-+          +-+
   |A|x25 more  |A|x25 more
   +-+          +-+
Length = 2 (Offset = 0)

                             / "??T" IS IN HERE
   +-+          +-+          +-+
   |A|x25 more  |A|x25 more  |A|x25 more
   +-+          +-+          +-+
Length = 3 (Offset = 52 * 6)

   +-+          +-+          +-+          +-+
   |A|x25 more  |A|x25 more  |A|x25 more  |A|x25 more
   +-+          +-+          +-+          +-+

Length = 4 (Offset = 130 * 6)

...

Continues up to Length = MAX-LENGTH

In the example above, the query "??T" is found at offset 744. The next 6 bytes would contain the integer offset within the Letter List of the fragments, and a gushort listing the length of the section.

Assumptions

We’re storing the length of each filter fragment section in the word list as a gushort, so we assume that none has a length longer than 65,536 words. So far that’s true with the lists we’re using.

Anagram Hash Table

The Anagram Hash Table is used to store hashes for looking up anagrams. It is structured similar to the Filter Fragment section: first, there is the Anagram Words block which has the link between hashes and words. Then, there’s the Anagram Hash Index which is used to actually look up the hashes and find the index in the Anagram Words section. These are described below:

Anagram Words

This is a block of gushorts, with each being the index of a WordIndex. Each index is included exactly once.

Anagram Hash Index

This section has a mapping from anagram hash. It is structured as an array, with each block being 9 bytes long. The block contains three components:

OFFSET: The offset from the base of the Anagram Words block
HASH: The hash of the sorted characters of each word
LEN: How long each hash list will be. For most words without a matching anagram word, the length of this is 1.

This list is sorted by hash value, and the expectation is that one can do a binary search through the array to find the hash

+--------------+----------+------+--­
+ guint        | guint    |guchar| (REPEATED)
+--------------+----------+------+--­
  OFFSET         HASH      LEN

NOTE: there’s room for optimizing this structure significantly if file size is a problem. In addition, the 9 byte block leaves us unaligned in the data structure

Index Section

The Index Section is a json block located at the end of the file. It’s located by seeking to the end of the file and scanning backwards until a ‘\0’ character is reached. Despite being at the end of the file, it’s read first as it has the keys to understand the data within this file.

Data we store in the header:

Valid character list found in the puzzle. This can be used for building a filter, and is stored as an array of unichars
Meta information about the data that’s stored
Locations for each word list. Example, the location for all 2 letter words, all 3 letter words, etc.
Loctaion of the filter fragment and anagram hash sections

Example Index

\0{
  "charset": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
  "filterchar": "?",
  "min-length": 2,
  "max-length": 21,
  "threshold": 50,
  "letter-list-offset": OFFSET,
  "letter-index-offset": OFFSET,
  "anagram-word-list-offset": OFFSET,
  "anagram-hash-index-offset": OFFSET,
  "anagram-hash-index-length": LENGTH,
  "words": [[2, STRIDE, OFFSET, LENGTH],
            [3, STRIDE, OFFSET, LENGTH],
            [4, STRIDE, OFFSET, LENGTH],
            ...
            [21, STRIDE, OFFSET, LENGTH]]
}

Note the existence of the leading \0 character to indicate the start of the section.

CHARSET is a unicode string, UTF-8 encoded.
STRIDE is the maximum width (in bytes) of each word in the table (see note below).
OFFSET is an unsigned int with the number of bytes into the file the table starts.
LENGTH is the number of entries in the table.

Internal representation

It’s really easy to get confused among all the internal tables. As a result, we use the following structs as our primary reference that we look things up.

typedef struct
{
  gint length;
  gint index;
} WordIndex;

typedef struct
{
  gint length;      /* Length of the word */
  gint position;    /* Position of the character within the word */
  gint char_offset; /* location within the charset of the word */
} FilterFragment;

The WordIndex gives as a reference to a word (and its priority). We can look up a word with just these two fields. It is public and refersable. So we can look up a word by Word Index, as well as find the word index of a given word (if it exists).

The FilterFragment gives us an individual part of a filter, with only one character selected. So “???G????” is a filter fragment, and “???G??N?” is made out of two fragments combined. We can look up a list of WordIndexes for a given Filter Fragment.

Charsets and Internationalization

We have made every effort to make this UTF-8 clean / Unicode-aware. We normalize every word using Normalization Form C. For English, this has the practical result that every character is uppercase.

The charset is important for calculating the filter table. We index chacters based on their offset within the Charset. For the default english charset it’s “A” == 0; “B” = 1;, etc. For other languages, accents and other marks are considered independent characters. So the charset for spanish would have both N and Ñ in it, as well as accented versions of all the letters.

The difference between word length and stride is not always clear. An example is the french word PÂTÉ. It has a LENGTH of 4, and a STRIDE of six. The Â with the circumflex would be considered an independent character, and not be stored along with an unaccented A.

A note on the words being used:

I’m using Peter Broda’s word list for English crosswords. It’s long and has plenty of dubious words, but it’s a good starting point. We will pre-generate the mmapped file and update it before each release. This word list seems to be actively maintained and updates regularly.

This file is mostly clean. It’s plain ASCII with \r\n line breaks. There are a few words in it with non-letters in it, and we just discard those (Example: MIA!;50 has an exclamation point in it). There’s no explicit license in the file but permission is granted on the website:

You are free to use this list in any way you’d like. This includes commercial uses, though I’d appreciate it if you didn’t just turn around and try to sell it (but I mean, I’ll still offer it for free to anyone so that wouldn’t be a smart business venture anyway).

Alternate word lists

The wordnik wordlist is also available. It has no phrases and fewer words than the Peter Broda list, but that might be better for your puzzles. It can be used instead at compile time by enabling it here and commenting out the line above it.