Word List

Status: Final | Implemented | March 2024

Updated: November 2024

The word list provides an efficient way to store and search for words for the editor. It is not intended to provide fuzzy–, textual–, or regexp– style matching, but instead it focuses on pure letter matching.

Requirements

We need to support the following use cases:

  • Three modes of operation:

    • MATCH: Match by letters. For example, "?OR??" will match WORDS and CORES, among others.

    • ANAGRAM: Find anagrams out of a set of letters. So CAT will return ACT, CAT, TAC, and TCA

    • INTERSECTION: Find the words that fit when two filters cross each other. This is used for the EditWordList

  • Efficiency and speed. We want to be as fast as possible when searching for words.

  • Multiple concurrent instances of the word list. We have many Word List instances in the app, and use them from separate threads.

  • Internationalized word lists. It should be UTF-8 clean.

  • Hot swappable data sources. We want to be able to switch to different lists of words.

  • Letter Frequencies of the words. This is needed for the solver.

We also need this to be as fast as possible. In the Crossword Editor, we use matches to efficiently build puzzles in order to find possible solutions for sections of the puzzle, or provide hints for clues. Currently, all lookups are synchronous, which puts an upper bound for the time of each operation. We are currently able to do a lookup in less than 10 microseconds.

We never need to return a list of words of mixed word LENGTH. It’s always sorted by word length. Eg, we never have any list with both CAT and MOUSE in it. Also, note that when we talk about length, we mean the number of code points for a given word, and not the number of bytes. So AÑO has a length of three, despite being four bytes long (encoded in UTF-8).

Overall approach

We store the word data in an mmappable() file. This file is generated at compile time, and loaded as a GResource. The file consists of five distinct blocks of data, each described in greater detail down below:

  1. Word List – A list of all the words in the data set. These words are bucketed by word length in different subsections. Within each bucket they are stored first by priority, and then alphabetically.

  2. Filter Fragment List – A list of words, indexed by letter, sorted by word length. This is used to MATCH filters. For example: The three letter word subsection would have all the words that match "A??" followed by all the words matching "B??", all the way up through all the words matching "??Z".

  3. Hash List – A sorted table of hashes of each word sorted. This is used by the anagram mode.

  4. Letter Frequency List – A precomputed list of letter frequencies in the word list. Used for autosolve.

  5. Index Section – The table of contents for the file. It will have byte offsets for each section and subsection of the file. It’s located at the end of the file, and is stored as a json block.

Everything in the list will be indexed by a byte offset into the data file. Some of the byte-offsets are absolute; some are relative. We mmap the file and seek through it to find what we need.

MATCH mode approach

When in MATCH mode, we decompose the given filter into different filter fragments. A filter fragment is a filter pattern with a single character listed. We precalculate the list of words associated with each fragment in the Filter Fragment List.

So, for example, consider the filter ??MO??T. For this filter, we break it up into three filter fragments: ??M????, ???O???, and ??????T. We build a list out of the intersection of each filter fragments words. This intersection operation is quick to be done, as the list is sorted so we only have to take one pass through each filter.

As special cases, we can bypass the intersection when our filter has 0 or 1 character. When the filter is only has ?s we can just use the Word List bucket for that word length. When we are looking for a filter pattern with one character selected (such as ??M????) the Letter Index can be used.

ANAGRAM mode approach

In order to find all the anagrams of the filter, we first sort each word’s letters and hash the result. For each hash, we store a list of the words that share that hash. That gives us a quick and efficient way of very quickly finding all anagrams for a word. There are hash-collisions in our word list, but not very many. That means that when we look up the words for the hash, we need to double check the results are anagrams.

NOTE One tradeoff of this approach versus a traditional trie-based approach is that we can’t easily handle anagrams with unknown letters. That means that we can’t easily generate a list of words whose anagrams match CDO?.

INTERSECTION mode approach

The intersection mode is significantly more complex than either of the other two modes. It takes in two filters, and takes a position within the two filters that indicates where they cross. Consider the following grid fragment:

#

E

?

D

?

?

?

?

Y

We would pass in E???Y with a position of 2 for the first filter, and D??? and a position of 1 for the second. This operation will calculate possible characters that can go in the crossing cell (in bold), as well as the two list of possible words for those two filters.

In this example, ‘J’ is a possible crossing character (for ENJOY and DJIN), but ‘V’ is not.

API

WordList is a GObject that implements a GListModel-like API. It has a stateful “filter” property that determines what words it makes available. The filter defaults to "" – the empty word list.

The value of the filter depends on the mode of the word list. Question marks are not parsed as an unknown word in ANAGRAM mode.

typedef enum
{
  WORD_LIST_NONE,
  WORD_LIST_MATCH,
  WORD_LIST_ANAGRAM,
} WordListMode;

WordList    *word_list_new                  (void);
void         word_list_set_filter           (WordList      *word_list,
                                             const gchar   *filter);
                                             WordListMode   mode);
guint        word_list_get_n_items          (WordList      *word_list);
const gchar *word_list_get_word             (WordList      *word_list,
                                             guint          position);
gint         word_list_get_priority         (WordList      *word_list,
                                             guint          position);

gfloat       word_list_get_char_frequency   (WordList      *word_list,
                                             guint          word_len,
                                             guint          letter_position,
                                             gunichar       c);
void         word_list_find_intersection    (WordList      *word_list,
                                             const gchar   *filter1,
                                             const guint    pos1,
                                             const gchar   *filter2,
                                             const guint    pos2,
                                             IpuzCharset  **intersecting_chars,
                                             WordArray    **filter1_words,
                                             WordArray    **filter2_words);

The priority of each word is a value between 0 and 255, and defaults to 50.

Like the GListModel interface, this lets the user get the number of words (per filter) and get the word/priority for each position. However, unlike GListModel it doesn’t emit signals and doesn’t return GObjects for performance reasons. That’s a relatively simple task to do, and WordListModel is a wrapper around WordList that provides this interface.

Note that WordList gives completely stable results. It will always return the same answers for a given filter. This means that you can set the filter to be some value and iterate through the items, change the filter, and then set it back and continue your iteration.

Data Sections

As mentioned, the overall resource file is divided into four sections of data:

  • Word List

  • Filter Fragments

    • Letter List

    • Filter Index

  • Anagram Hash Table

    • Anagram Words

    • Anagram Hash Index

  • Letter Frequency Table

In addition, there’s an index section that’s included with the GResource.

Each is described in detail below.

NOTE: Each section is defined to be stored in little-endian byte order.

Word List Sections

This block stores all the words along with their priority. The block is divided into multiple Word List Sections – one for each word length. So, for example, there’s a section for all the words that are three characters long, followed by a section for all the words that are four characters long, etc.

Each word entry in a section consists of a 1-byte priority stored as an unsigned char, followed by a UTF-8 string terminated by the null character. It will be padded out by ‘\0’ characters to fit the longest word (byte-wise) word within the section, stored as STRIDE. See the charset section below for a little more clarification.

Filter Fragment section

The Filter Fragment section contains two blocks of memory. The first block contains the Letter Lists, which are basically a big block of gushorts. This is a list of all the indexes for a WordIndex for a filter fragment concatenated together. It’s not possible to find out any state from within the block, and it’s not possible to calculate anything about it. They’re in a predetermined order, but there’s nothing special about the order.

The Letter List is also large; we store a gushort per letter per word, which means it is twice as big as the Word List (for English). As an example, for the fragment ??T we store the offset of every word in the three-letter Word List block that matches that pattern. That will be immediately followed by a list of all the words that match ??U, and so forth.

The second part of the Filter Fragment section is the Filter index for the Letter List block. This is a segment of geometrically increasing set of offsets to determine the section of the Letter List contains the list of words. We store an guint for the byte offset for each list, and a gushort for the length of each list.

The Filter Fragment section is super confusing to describe, so here’s an example. Consider the query "??T":

Start at Length = MIN-LENGTH

   +-+          +-+
   |A|x25 more  |A|x25 more
   +-+          +-+
Length = 2 (Offset = 0)

                             / "??T" IS IN HERE
   +-+          +-+          +-+
   |A|x25 more  |A|x25 more  |A|x25 more
   +-+          +-+          +-+
Length = 3 (Offset = 52 * 6)

   +-+          +-+          +-+          +-+
   |A|x25 more  |A|x25 more  |A|x25 more  |A|x25 more
   +-+          +-+          +-+          +-+

Length = 4 (Offset = 130 * 6)

...

Continues up to Length = MAX-LENGTH

In the example above, the query "??T" is found at offset 744. The next 6 bytes would contain the integer offset within the Letter List of the fragments, and a gushort listing the length of the section.

Assumptions

We’re storing the length of each filter fragment section in the word list as a gushort, so we assume that none has a length longer than 65,536 words. So far that’s true with the lists we’re using.

Anagram Hash Table

The Anagram Hash Table is used to store hashes for looking up anagrams. It is structured similar to the Filter Fragment section: first, there is the Anagram Words block which has the link between hashes and words. Then, there’s the Anagram Hash Index which is used to actually look up the hashes and find the index in the Anagram Words section. These are described below:

Anagram Words

This is a block of gushorts, with each being the index of a WordIndex. Each index is included exactly once.

Anagram Hash Index

This section has a mapping from anagram hash. It is structured as an array, with each block being 9 bytes long. The block contains three components:

  • OFFSET: The offset from the base of the Anagram Words block

  • HASH: The hash of the sorted characters of each word

  • LEN: How long each hash list will be. For most words without a matching anagram word, the length of this is 1.

This list is sorted by hash value, and the expectation is that one can do a binary search through the array to find the hash

+--------------+----------+------+--­
+ guint        | guint    |guchar| (REPEATED)
+--------------+----------+------+--­
  OFFSET         HASH      LEN

NOTE: there’s room for optimizing this structure significantly if file size is a problem. In addition, the 9 byte block leaves us unaligned in the data structure. Some quick tests show that alignment doesn’t really matter in intel processors right now.

Letter Frequency Section

The letter frequency section stores a list of frequency of each letter at each position for each word list. It’s the same geometrically increasing layout as the filter fragment section, so each letter/position location is predictable. The sum of all letters at a given position obviously adds up to 1.0.

This frequency is used by the word solver to pick a best next letter to try, and is fairly niche.

Index Section

NOTE: This section is currently included as a separate file within the GResouce.

The Index Section is a separate file in the GResource. Data we store in the index:

  • Valid character list found in the puzzle. This can be used for building a filter, and is stored as an array of unichars

  • Meta information about the data that’s stored

  • Locations for each word list. Example, the location for all 2 letter words, all 3 letter words, etc.

  • Location of the filter fragment and anagram hash sections

Example Index

{
  "display-name": "Peter Broda's Wordlist",
  "charset": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
  "filterchar": "?",
  "min-length": 2,
  "max-length": 21,
  "threshold": 50,
  "visibility": "editor",
  "letter-list-offset": OFFSET,
  "letter-index-offset": OFFSET,
  "anagram-word-list-offset": OFFSET,
  "anagram-hash-index-offset": OFFSET,
  "anagram-hash-index-length": LENGTH,
  "frequency-table-offset": OFFSET,
  "words": [[2, STRIDE, OFFSET, LENGTH],
            [3, STRIDE, OFFSET, LENGTH],
            [4, STRIDE, OFFSET, LENGTH],
            ...
            [21, STRIDE, OFFSET, LENGTH]]
}
  • CHARSET is a unicode string, UTF-8 encoded.

  • STRIDE is the maximum width (in bytes) of each word in the table (see note below).

  • OFFSET is an unsigned int with the number of bytes into the file the table starts.

  • LENGTH is the number of entries in the table.

  • VISIBILITY is a string. It can be editor, player, or test.

Internal representation

It’s really easy to get confused among all the internal tables. As a result, we use the following structs as our primary reference that we look things up.

typedef struct
{
  gint length;
  gint index;
} WordIndex;

typedef struct
{
  gint length;      /* Length of the word */
  gint position;    /* Position of the character within the word */
  gint char_offset; /* location within the charset of the word */
} FilterFragment;

The WordIndex gives as a reference to a word (and its priority). We can look up a word with just these two fields. It is public and refersable. So we can look up a word by Word Index, as well as find the word index of a given word (if it exists).

The FilterFragment gives us an individual part of a filter, with only one character selected. So “???G????” is a filter fragment, and “???G??N?” is made out of two fragments combined. We can look up a list of WordIndexes for a given Filter Fragment.

Charsets and Internationalization

We have made every effort to make this UTF-8 clean / Unicode-aware. We normalize every word using Normalization Form C. For English, this has the practical result that every character is uppercase.

The charset is important for calculating the filter table. We index chacters based on their offset within the Charset. For the default english charset it’s “A” == 0; “B” = 1;, etc. For other languages, accents and other marks are considered independent characters. So the charset for spanish would have both N and Ñ in it, as well as accented versions of all the letters.

The difference between word length and stride is not always clear. An example is the french word PÂTÉ. It has a LENGTH of 4, and a STRIDE of six. The  with the circumflex would be considered an independent character, and not be stored along with an unaccented A.

Wordlist sources

There are two wordlist we use. The Peter Broda Wordlist, and the wordnik wordlist:

Peter Broda’s word list

This word list is long and has plenty of dubious words, but it’s a good starting point. This word list seems was actively maintained and updated regularly, though had an outage in early 2024. We have a relatively late copy of it. It’s main advantage is that it provides enumerations and some scoring. The scoring is dubious though — I wouldn’t score the words the way that this list has done. It also has a high count of profane words with a high score.

This file is mostly clean. It’s plain ASCII with \r\n line breaks and mixed case. There are a few words in it with non-letters in it, and we just discard those (Example: MIA!;50 has an exclamation point in it). There’s no explicit license in the file but permission is granted on the website:

You are free to use this list in any way you’d like. This includes commercial uses, though I’d appreciate it if you didn’t just turn around and try to sell it (but I mean, I’ll still offer it for free to anyone so that wouldn’t be a smart business venture anyway).

Wordnik word list

The wordnik wordlist is also included. It has no phrases and fewer words than the Peter Broda list, but that might be better for some puzzles. It’s effectively the union of the US/UK scrabble dictionaries, plus some extra words. It’s stored as a list of lowercase words in quotes with a newline between them.

It is available under an MIT license.

Areas for improvement

  • When doing the intersection of the filter fragments, we go through it sequentially in the order of the filter. So, if we have a filter of “S???AZ”, we start with the “S?????” filter and then do an intersection with “????A?”. Both are quite lengthy. We then intersect with “?????Z” which has much fewer words (~20).

    We should sort the filters from shortest to longest. We are likely to fail faster in the case when there are no matches, and possibly have a very short list to intersect against (which could end early).

    We could also experiment with doing a binary search to find sections to skip.

  • The ANAGRAM and FILTER modes both set the WordList to a mode where you set the filter, then query the object. On the other hand, INTERSECTION returns a pair of WordArrays that operate independent of the WordList internal state. This should be reconciled.

  • Change all the offsets to be relative in the code. That would let us install each section in its own file in the future.

  • Fix the endianness when parsing. The data is now encoded as little-endian and stored in git, so we will have to byte-swap on big-endian machines. Currently, we disable s390 support as I haven’t written that yet