# Word List
**Status**: MATCH and ANAGRAM completed

The word list provides an efficient way to store and search for words
for the editor.

## Requirements

We need to support the following use cases:

* We support two filter modes:
    * **MATCH:** Filter by letters. For example, `"?OR??"` will
      match WORDS and CORES, among others.
    * **ANAGRAM:** Find anagrams out of a set of letters. So `CAT`
      will return `ACT`, `CAT`, `TAC`, and `TCA`
* Efficient lookup of words when filling in a crossword. We want to
  avoid linear searches through the list of words.
* Multiple concurrent instances of the word list.
* Internationalized word lists.

We also need this to be as fast as possible. In the Crosword Editor,
we use matches to efficiently build puzzles recursively in order to
find possible solutions for sections of the puzzle.

We never need to have a list of words of mixed word *LENGTH*. It's
_always_ sorted by word length. Eg, we never have any list with both
CAT with MOUSE in it. Also, note that when we talk about length, we
mean the number of code points for a given word, and not the number of
bytes. So `AÑO` has a length of three, despite being four bytes long
(encoded in UTF-8).

## Overall approach
We store the word data in an mmappable() file. This file is generated
at compile time, and loaded as a `GResource`. The file consists of
four distinct blocks of data, each described in greater detail down
below:

1. **Word List** – A list of all the words in the data set. These
   words are bucketed by word length in different subsections. Within
   each bucket they are stored first by priority, and then
   alphabetically.
1. **Filter Fragment List** – A list of words, indexed by letter,
   sorted by word length. This is used to MATCH filters. For example:
   The three letter word subsection would have all the words that
   match `"A??"` followed by all the words matching `"B??"`, all the
   way up through all the words matching `"??Z"`.
1. **Hash List** – A sorted table of hashes of each word sorted. This
   is used by the anagram mode.
1. **Index Section** – The table of contents for the file. It will
   have byte offsets for each section and subsection of the file. It's
   located at the end of the file, and is stored as a json block.

Everything in the list will be indexed by a byte offset into the data
file. Some of the byte-offsets are absolute; some are relative. We
mmap the file and seek through it to find what we need.

### MATCH mode approach

When in MATCH mode, we decompose the given filter into different
filter fragments. A filter fragment is a pilter pattern with one
character listed, and is precalculated in the **Filter Fragment
List**.

So, for example, consider the filter `??MO??T`. For this filter, we
break it up into three filter fragments: `??M????`, `???O???`, and
`??????T`. We build a list out of the intersection of each filter
fragments words. This intersection operation is quick to be done, as
the list is sorted so we only have to take one pass through each
filter.

As special cases, we can bypass the intersection when our filter has 0
or 1 character. When the filter is only has `?`s we can just use the
_Word List_ bucket for that word length. When we are looking for a
filter pattern with one character selected (such as `??M????`) the
_Letter Index_ can be used.

### ANAGRAM mode approach

In order to find all the anagrams of the filter, we first sort each
word's letters and hash the result. For each hash, we store a list of
the words that share that hash. That gives us a quick and efficient
way of very quickly finding all anagrams for a word. There are
hash-collisions in our word list, but not very many. That means that
when we look up the words for the hash, we need to double check the
results are anagrams.

**NOTE** One tradeoff of this approach versus a traditional trie-based
approach is that we can't easily handle anagrams with unknown
letters. That means that we can't easily generate a list of words
whose anagrams match `CDO?`.

## API
WordList is a `GObject` that implements a `GListModel`-like API. It
has a stateful "filter" property that determines what words it makes
available. The filter defaults to `""` – the empty word list.

The value of the filter depends on the mode of the word list. Question
marks are not parsed as an unknown word in ANAGRAM mode.

```c
typedef enum
{
  WORD_LIST_NONE,
  WORD_LIST_MATCH,
  WORD_LIST_ANAGRAM,
} WordListMode;

WordList    *word_list_new                  (void);
void         word_list_set_filter           (WordList     *word_list,
                                             const gchar  *filter);
                                             WordListMode  mode);
guint        word_list_get_n_items          (WordList     *word_list);
const gchar *word_list_get_word             (WordList     *word_list,
                                             guint         position);
gint         word_list_get_priority         (WordList     *word_list,
                                             guint         position);
```

The priority of each word is a value between 0 and 255, and defaults
to 50.

Like the `GListModel` interface, this lets the user get the number of
words (per filter) and get the word/priority for each
position. However, unlike `GListModel` it doesn't emit signals and
doesn't return GObjects for performance reasons. That's a relatively
simple task to do, and `WordListModel` is a wrapper around `WordList`
that provides this interface.

Note that `WordList` is completely stable. It will always return the
same answers for a given filter. This means that you can set the
filter to be some value and iterate through the items, change the
filter, and then set it back and continue your iteration.

## Data Sections

As mentioned, the overall resource file is divided into four sections
of data:
* **Word List**
* **Filter Fragments**
   * **Letter List**
   * **Filter Index**
* **Anagram Hash Table**
   * **Anagram Words**
   * **Anagram Hash Index**
* **Index Section**

Each is described in detail below.

### Word List Sections
This block stores all the words along with their priority. The block
is divided into multiple _Word List Sections_ – one for each word
length. So, for example, there's a section for all the words that are
three characters long, followed by a section for all the words that
are four characters long, etc.

Each word entry in a section consists of a 1-byte priority stored as
an unsigned char, followed by a UTF-8 string terminated by the null
character. It will be padded out by '\0' characters to fit the longest
word (byte-wise) word within the section, stored as STRIDE. See the
charset section below for a little more clarification.

### Filter Fragment section
The _Filter Fragment_ section contains two blocks of memory. The first
block contains the _Letter Lists_, which are basically a big block of
gushorts. This is a list of all the indexes for a `WordIndex` for a
filter fragment concatenated together. It's not possible to find out
any state from within the block, and it's not possible to calculate
anything about it. They're in a predetermined order, but there's
nothing special about the order.

The _Letter List_ is also large; we store a `gushort` per letter per
word, which means it is twice as big as the _Word List_ (for
English). As an example, for the fragment `??T` we store the offset of
every word in the three-letter `Word List` block that matches that
pattern. That will be immediately followed by a list of all the words
that match `??U`, and so forth.

The second part of the _Filter Fragment_ section is the _Filter index_
for the _Letter List_ block. This is a segment of geometrically
increasing set of offsets to determine the section of the _Letter
List_ contains the list of words. We store an guint for the byte
offset for each list, and a gushort for the length of each list.

The _Filter Fragment_ section is super confusing to describe, so
here's an example. Consider the query `"??T"`:

```plaintext
Start at Length = MIN-LENGTH

   +-+          +-+
   |A|x25 more  |A|x25 more
   +-+          +-+
Length = 2 (Offset = 0)

                             / "??T" IS IN HERE
   +-+          +-+          +-+
   |A|x25 more  |A|x25 more  |A|x25 more
   +-+          +-+          +-+
Length = 3 (Offset = 52 * 6)

   +-+          +-+          +-+          +-+
   |A|x25 more  |A|x25 more  |A|x25 more  |A|x25 more
   +-+          +-+          +-+          +-+

Length = 4 (Offset = 130 * 6)

...

Continues up to Length = MAX-LENGTH
```

In the example above, the query `"??T"` is found at offset 744. The
next 6 bytes would contain the integer offset within the _Letter List_
of the fragments, and a `gushort` listing the length of the section.

#### Assumptions
We're storing the length of each filter fragment section in the word
list as a gushort, so we assume that none has a length longer than
65,536 words. So far that's true with the lists we're using.

### Anagram Hash Table
The _Anagram Hash Table_ is used to store hashes for looking up
anagrams. It is structured similar to the _Filter Fragment_ section:
first, there is the _Anagram Words_ block which has the link between
hashes and words. Then, there's the _Anagram Hash Index_ which is used
to actually look up the hashes and find the index in the _Anagram
Words_ section. These are described below:

#### Anagram Words
This is a block of gushorts, with each being the index of a
`WordIndex`. Each index is included exactly once.

#### Anagram Hash Index
This section has a mapping from anagram hash. It is structured as an
array, with each block being 9 bytes long. The block contains three
components:
* **OFFSET:** The offset from the base of the _Anagram Words_ block
* **HASH:** The hash of the sorted characters of each word
* **LEN:** How long each hash list will be. For most words without a
  matching anagram word, the length of this is 1.

This list is sorted by hash value, and the expectation is that one can
do a binary search through the array to find the hash

```plaintext
+--------------+----------+------+--­
+ guint        | guint    |guchar| (REPEATED)
+--------------+----------+------+--­
  OFFSET         HASH      LEN
```

**NOTE:** there's room for optimizing this structure significantly if
file size is a problem. In addition, the 9 byte block leaves us
unaligned in the data structure

### Index Section
The _Index Section_ is a json block located at the end of the
file. It's located by seeking to the end of the file and scanning
backwards until a '\0' character is reached. Despite being at the end
of the file, it's read first as it has the keys to understand the data
within this file.

Data we store in the header:

* Valid character list found in the puzzle. This can be used for
  building a filter, and is stored as an array of unichars
* Meta information about the data that's stored
* Locations for each word list. Example, the location for all 2 letter
  words, all 3 letter words, etc.
* Loctaion of the filter fragment and anagram hash sections

#### Example Index
```json
\0{
  "charset": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
  "filterchar": "?",
  "min-length": 2,
  "max-length": 21,
  "threshold": 50,
  "letter-list-offset": OFFSET,
  "letter-index-offset": OFFSET,
  "anagram-word-list-offset": OFFSET,
  "anagram-hash-index-offset": OFFSET,
  "anagram-hash-index-length": LENGTH,
  "words": [[2, STRIDE, OFFSET, LENGTH],
            [3, STRIDE, OFFSET, LENGTH],
            [4, STRIDE, OFFSET, LENGTH],
            ...
            [21, STRIDE, OFFSET, LENGTH]]
}

```
_Note the existence of the leading `\0` character to indicate the start of the section._

* CHARSET is a unicode string, UTF-8 encoded.
* STRIDE is the maximum width (in bytes) of each word in the table (see note below).
* OFFSET is an unsigned int with the number of bytes into the file the table starts.
* LENGTH is the number of entries in the table.

## Internal representation

It's really easy to get confused among all the internal tables. As a
result, we use the following structs as our primary reference that we
look things up.

```c
typedef struct
{
  gint length;
  gint index;
} WordIndex;

typedef struct
{
  gint length;      /* Length of the word */
  gint position;    /* Position of the character within the word */
  gint char_offset; /* location within the charset of the word */
} FilterFragment;
```

The WordIndex gives as a reference to a word (and its priority). We
can look up a word with just these two fields. It is public and
refersable. So we can look up a word by Word Index, as well as find
the word index of a given word (if it exists).

The FilterFragment gives us an individual part of a filter, with only
one character selected. So "???G????" is a filter fragment, and
"???G??N?" is made out of two fragments combined. We can look up a
list of WordIndexes for a given Filter Fragment.

## Charsets and Internationalization
We have made every effort to make this UTF-8 clean / Unicode-aware. We
normalize every word using [Normalization Form
C](https://unicode.org/reports/tr15/). For English, this has the
practical result that every character is uppercase.

The charset is important for calculating the filter table. We index
chacters based on their offset within the Charset. For the default
english charset it's "A" == 0; "B" = 1;, etc. For other languages,
accents and other marks are considered independent characters. So the
charset for spanish would have both N and Ñ in it, as well as accented
versions of all the letters.

The difference between word length and stride is not always clear. An
example is the french word `PÂTÉ`. It has a LENGTH of 4, and a STRIDE
of six. The `Â` with the circumflex would be considered an independent
character, and not be stored along with an unaccented `A`.

## A note on the words being used:
I'm using [Peter Broda's word
list](https://peterbroda.me/crosswords/wordlist/?fbclid=IwAR04CeR_nhEW5M7CoK6Pyc3lxtzAlD9i9nk6pYadGXWtWN9pTBNWvHCE2hk)
for English crosswords. It's long and has plenty of dubious words, but
it's a good starting point. We will pre-generate the mmapped file and
update it before each release. This word list seems to be actively
maintained and updates regularly.

This file is mostly clean. It's plain ASCII with \r\n line
breaks. There are a few words in it with non-letters in it, and we
just discard those (Example: _MIA!;50_ has an exclamation point in
it). There's no explicit license in the file but permission is granted
on the website:

> You are free to use this list in any way you'd like. This includes
> commercial uses, though I'd appreciate it if you didn't just turn
> around and try to sell it (but I mean, I'll still offer it for free
> to anyone so that wouldn't be a smart business venture anyway).

### Alternate word lists

The wordnik wordlist is also available. It has no phrases and fewer
words than the Peter Broda list, but that might be better for your
puzzles. It can be used instead at compile time by enabling it
[here](https://gitlab.gnome.org/jrb/crosswords/-/blob/master/src/meson.build?ref_type=heads#L206)
and commenting out the line above it.