Fundamental Puzzle Type Survey
Status: Proposal | Partially Implemented | July 2024
Over the course of building Crosswords, we’ve identified a number of puzzle-specific data structures that are needed by the code base. Some of these have robust implementations, others are partially implemented, while some are reimplemented multiple places in the code base or just don’t exist.
This document is an attempt to list all the fudamental types we may need as well as list some of the performance tradeoffs associated with them. The end goal is to move a number of these types into libipuz so that it can be used universally.
Fundamental Structs
We have three fundamental types that are used everywhere: The coord, the clueid and the unichar.
typedef struct
{
IpuzClueDirection direction;
guint index;
} IpuzClueId;
typedef struct
{
guint row;
guint column;
} IpuzCellCoord;
typedef guint32 gunichar;
Of these IpuzCellCoord
is the most embedded in the codebase.
All these structs give a semantic reference to the puzzle without requiring a pointer.
Introspection considerations
As an additional goal, we have a number of libipuz APIs that
accept/return basic GLib data types (mostly GArray
). Best practice
for gobject-introspection is to move to custom types in order to make
them bindable. Some considerations.
Structures should be be ref counted for complex data types, and copy/free for simple types. If a structure is mutable and ref counted, then it also needs a
_dup()
function to make a copy of it rather then a ref. This is needed foripuz_puzzle_deep_copy()
Iters are challenging. g-i does not have support for an unmanaged gpointer (such as
CharsetNode
), and rust doesn’t make it easy. Wherever possible, we should use an integer-indexed access pattern.That means iterable structs should support:
_get_count()
and_index()
methods.
Text and Strings
First a note on text handling. All text internally is encoded as
UTF-8
. We validate it on input, and assume it’s valid everywhere in
the codebase (similar to GTK). In multiple places, we effectively
convert the string to UCS-4
by treating it as a stream of
gunichars
.
Because of the gridded nature of crosswords, things get a weird when we break text into discrete cells independent of the larger context. Breaking up a unicode string a character at a time can result in nonsensical results. It works well enough for English and other latin scripts, but long term we need to move to a stream of discrete graphemes. We’ve taken some initial steps that way with the misc cluster functions, but that is a really basic implementations and is just used for paste.
Ideally we’d find a robust grapheme library and involve that for indexing strings.
Note: this is all predicated on graphemes being needed for other languages, such as Indic scripts.
WordList
Note, the WordList currently does not support multi-codepoint graphemes, though it does support multi-byte UTF-8 strings. If we ever support a language with a word-list that requires that we will have to reassess this.
Custom Data Types
Character Set
IpuzCharset
is used in a surprising number of places. It’s one of
the most versatile structures we have. Some places it’s used:
Each puzzle has a charset listing its valid characters
This charset can be passed to PlayCell to filter out letters that aren’t part of the puzzle
We use it to keep a histogram of letters in use in a puzzle
We also keep a histogram of clue lengths by casting the length to a gunichar (!)
It’s used in the fast path for both the Grid and Acrostic solver. We add and remove characters to them every iteration of the loop.
We calculate the potential overlapping characters in a cell to limit options for users.
etc.
Challenges:
Currently, we have the mutable charset builder -> charset pattern. That has kept lookups fast for the Acrostic solver, but we’ve ran into places where we need to keep a mutable charset around. As an example, in
word_list_find_intersections()
, we want to set a builder’s value to be a multiple of another factor. There’s no way to go through a builder and adjust its values, so we have to create a second one.We store a unicode codepoint per node. This is fine for all the languages we support, but may not be fine for eg. hindi. In the future, we may need to index these by a grapheme, instead of a unichar
Cell Array
We have a couple of instances of the CellArray and they have very different performance characteristics.
Selection keeps a sorted array and tests for uniqueness. Each cell should only exist once.
Another common use is every clue has a list of the cells that it refers to
The solver has a list of cells to search through, and removes/adds cells as it traverses the tree and backtracks. (This usecase may need a specialized structure though).
This data structure needs to be sorted/unique, or ordered depending on where it’s used.
Grid Words
Surprisingly, we have three different representations of words in the code base.
There are basic C strings. These come from user input or the puzzle file.
The
WordList
also has C strings encoded in it. These strings are special, as the bytes before it encode priority and enumerationsFinally, there’s the
WordIndex
struct. This struct can be used to lookup a word in aWordList
We commonly want to index clusters in the string. There is a lot of code that iterates through UTF-8 strings in the codebase that is incorrect (see string handling above) and thus we may want to consider a specialized word type. Consider the following (very artificial) grid entry:
This is encoded with the following UTF-8 string and cell boundaries:
S P ı N̈ A L T A P
53 50 C4 B1 4E CC 88 41 4C 54 41 50 00
| | | | | | |
the S and P characters are each one codepoint / byte. The dotless i (U+0131) is an additional codepoint and two bytes. The N-diaresis is contrived of two codepoints (U+004E and U+0308) and is three bytes. Finally, the last three characters — TAP — are all included in one cell as a rebus entry.
A very useful data structure would be one that breaks a word into cells, and lets you iterate through them.
It may also be useful to be able to indicate the offset of a cell
array it exists in the grid in so we can index it. An example of a
function that could return this is
ipuz_crossword_get_clue_string_by_id()
.
In the above example, word[6]
would return the string "TAP."
That
being said, outside the context of a grid the offsets are meaningless
and potentially misleading, so this would need a bit of thought to
make sure it is useful.
Word Array
We have lists of words showing up in multiple places. There’s
WordList
, WordArray
, WordListModel
,and WordArrayModel
. In
addition, multiple places just use a GArray
of strings. Some
consolidation would be nice at some point, especially if we ever get a
good grid word type.
Filters
Filters are used by the word list to describe a partially filled in
clue (eg. "GN??E"
). Currently, filters are represented as
strings. We have multiple independent implementations of filter
validation code. There are also a couple common operations that we run
frequently on filters, such as counting how many ‘?’ characters are in
the filter — are multiple fastpaths triggered by totally open or
closed filters. Having a simple filter class would be nice.
Guesses
IpuzGuesses is a simple overlay over a puzzle grid. It stores a string-per-cell, and is used to keep the guesses that a user makes in the game. It’s currently not a GObject, though there’s an outstanding MR to make it one.
Recently, the editor started using the guesses object to store pencil markings as a way to record scratch marks. It also is used to show results from the autofill functionality. Looking at other crossword editors, it may be useful to start running a background task that starts calculating things like intersection characters or mark trouble spots. If we start doing that, we’ll need to store more per cell than the current string. Given that, storing custom data per cell is a useful extension.
NOTE: We’d have to approach this somewhat carefully. Currently the
PuzzleStack
doesn’t keep guesses that are attached to puzzles and
these would be cleared every grid/guess change. We’ll have to test
assumptions, and see if keeping the Guesses around could save
computational work.
Proposed Structures
Here’s the proposed data structures for libipuz (July, 2024)
IpuzCellCoord
mut, dup
IpuzCellCoordArray
mut, dup, ref
IpuzCellCoordRegion
mut, dup, ref
IpuzClueId
mut, dup
IpuzCharsetBuilder
mut, dup
IpuzCharset
ref
IpuzEnumeration
ref
We will hold off on a separate word structures for now.