Fundamental Puzzle Type Survey

Status: Proposal | Partially Implemented | July 2024

Over the course of building Crosswords, we’ve identified a number of puzzle-specific data structures that are needed by the code base. Some of these have robust implementations, others are partially implemented, while some are reimplemented multiple places in the code base or just don’t exist.

This document is an attempt to list all the fudamental types we may need as well as list some of the performance tradeoffs associated with them. The end goal is to move a number of these types into libipuz so that it can be used universally.

Fundamental Structs

We have three fundamental types that are used everywhere: The coord, the clueid and the unichar.

typedef struct
{
  IpuzClueDirection direction;
  guint index;
} IpuzClueId;

typedef struct
{
  guint row;
  guint column;
} IpuzCellCoord;

typedef guint32 gunichar;

Of these IpuzCellCoord is the most embedded in the codebase.

All these structs give a semantic reference to the puzzle without requiring a pointer.

Introspection considerations

As an additional goal, we have a number of libipuz APIs that accept/return basic GLib data types (mostly GArray). Best practice for gobject-introspection is to move to custom types in order to make them bindable. Some considerations.

  • Structures should be be ref counted for complex data types, and copy/free for simple types. If a structure is mutable and ref counted, then it also needs a _dup() function to make a copy of it rather then a ref. This is needed for ipuz_puzzle_deep_copy()

  • Iters are challenging. g-i does not have support for an unmanaged gpointer (such as CharsetNode), and rust doesn’t make it easy. Wherever possible, we should use an integer-indexed access pattern.

    • That means iterable structs should support: _get_count() and _index() methods.

Text and Strings

First a note on text handling. All text internally is encoded as UTF-8. We validate it on input, and assume it’s valid everywhere in the codebase (similar to GTK). In multiple places, we effectively convert the string to UCS-4 by treating it as a stream of gunichars.

Because of the gridded nature of crosswords, things get a weird when we break text into discrete cells independent of the larger context. Breaking up a unicode string a character at a time can result in nonsensical results. It works well enough for English and other latin scripts, but long term we need to move to a stream of discrete graphemes. We’ve taken some initial steps that way with the misc cluster functions, but that is a really basic implementations and is just used for paste.

Ideally we’d find a robust grapheme library and involve that for indexing strings.

Note: this is all predicated on graphemes being needed for other languages, such as Indic scripts.

WordList

Note, the WordList currently does not support multi-codepoint graphemes, though it does support multi-byte UTF-8 strings. If we ever support a language with a word-list that requires that we will have to reassess this.

Custom Data Types

Character Set

IpuzCharset is used in a surprising number of places. It’s one of the most versatile structures we have. Some places it’s used:

  • Each puzzle has a charset listing its valid characters

  • This charset can be passed to PlayCell to filter out letters that aren’t part of the puzzle

  • We use it to keep a histogram of letters in use in a puzzle

    • We also keep a histogram of clue lengths by casting the length to a gunichar (!)

  • It’s used in the fast path for both the Grid and Acrostic solver. We add and remove characters to them every iteration of the loop.

  • We calculate the potential overlapping characters in a cell to limit options for users.

  • etc.

Challenges:

  • Currently, we have the mutable charset builder -> charset pattern. That has kept lookups fast for the Acrostic solver, but we’ve ran into places where we need to keep a mutable charset around. As an example, in word_list_find_intersections(), we want to set a builder’s value to be a multiple of another factor. There’s no way to go through a builder and adjust its values, so we have to create a second one.

  • We store a unicode codepoint per node. This is fine for all the languages we support, but may not be fine for eg. hindi. In the future, we may need to index these by a grapheme, instead of a unichar

Cell Array

We have a couple of instances of the CellArray and they have very different performance characteristics.

  • Selection keeps a sorted array and tests for uniqueness. Each cell should only exist once.

  • Another common use is every clue has a list of the cells that it refers to

  • The solver has a list of cells to search through, and removes/adds cells as it traverses the tree and backtracks. (This usecase may need a specialized structure though).

This data structure needs to be sorted/unique, or ordered depending on where it’s used.

Grid Words

Surprisingly, we have three different representations of words in the code base.

  • There are basic C strings. These come from user input or the puzzle file.

  • The WordList also has C strings encoded in it. These strings are special, as the bytes before it encode priority and enumerations

  • Finally, there’s the WordIndex struct. This struct can be used to lookup a word in a WordList

We commonly want to index clusters in the string. There is a lot of code that iterates through UTF-8 strings in the codebase that is incorrect (see string handling above) and thus we may want to consider a specialized word type. Consider the following (very artificial) grid entry:

spinal tap

This is encoded with the following UTF-8 string and cell boundaries:

S  P  ı             A  L  T  A  P
53 50 C4 B1 4E CC 88 41 4C 54 41 50 00
  |  |     |        |  |  |        |

the S and P characters are each one codepoint / byte. The dotless i (U+0131) is an additional codepoint and two bytes. The N-diaresis is contrived of two codepoints (U+004E and U+0308) and is three bytes. Finally, the last three characters — TAP — are all included in one cell as a rebus entry.

A very useful data structure would be one that breaks a word into cells, and lets you iterate through them.

It may also be useful to be able to indicate the offset of a cell array it exists in the grid in so we can index it. An example of a function that could return this is ipuz_crossword_get_clue_string_by_id().

In the above example, word[6] would return the string "TAP." That being said, outside the context of a grid the offsets are meaningless and potentially misleading, so this would need a bit of thought to make sure it is useful.

Word Array

We have lists of words showing up in multiple places. There’s WordList, WordArray, WordListModel,and WordArrayModel. In addition, multiple places just use a GArray of strings. Some consolidation would be nice at some point, especially if we ever get a good grid word type.

Filters

Filters are used by the word list to describe a partially filled in clue (eg. "GN??E"). Currently, filters are represented as strings. We have multiple independent implementations of filter validation code. There are also a couple common operations that we run frequently on filters, such as counting how many ‘?’ characters are in the filter — are multiple fastpaths triggered by totally open or closed filters. Having a simple filter class would be nice.

Guesses

IpuzGuesses is a simple overlay over a puzzle grid. It stores a string-per-cell, and is used to keep the guesses that a user makes in the game. It’s currently not a GObject, though there’s an outstanding MR to make it one.

Recently, the editor started using the guesses object to store pencil markings as a way to record scratch marks. It also is used to show results from the autofill functionality. Looking at other crossword editors, it may be useful to start running a background task that starts calculating things like intersection characters or mark trouble spots. If we start doing that, we’ll need to store more per cell than the current string. Given that, storing custom data per cell is a useful extension.

NOTE: We’d have to approach this somewhat carefully. Currently the PuzzleStack doesn’t keep guesses that are attached to puzzles and these would be cleared every grid/guess change. We’ll have to test assumptions, and see if keeping the Guesses around could save computational work.

Proposed Structures

Here’s the proposed data structures for libipuz (July, 2024)

  • IpuzCellCoord

    • mut, dup

  • IpuzCellCoordArray

    • mut, dup, ref

  • IpuzCellCoordRegion

    • mut, dup, ref

  • IpuzClueId

    • mut, dup

  • IpuzCharsetBuilder

    • mut, dup

  • IpuzCharset

    • ref

  • IpuzEnumeration

    • ref

We will hold off on a separate word structures for now.