# Fundamental Puzzle Type Survey **Status:** Proposal | Partially Implemented | July 2024 Over the course of building Crosswords, we've identified a number of puzzle-specific data structures that are needed by the code base. Some of these have robust implementations, others are partially implemented, while some are reimplemented multiple places in the code base or just don't exist. This document is an attempt to list all the fudamental types we may need as well as list some of the performance tradeoffs associated with them. The end goal is to move a number of these types into libipuz so that it can be used universally. # Fundamental Structs We have three fundamental types that are used everywhere: The coord, the clueid and the unichar. These types represent a clue, a cell, and a character respectively. Of these `IpuzCellCoord` is the most embedded in the codebase. ```C typedef struct { IpuzClueDirection direction; guint index; } IpuzClueId; typedef struct { guint row; guint column; } IpuzCellCoord; typedef guint32 gunichar; ``` > **NOTE:** the first two items refer to a _location_ in a puzzle, and > are not a pointer. They can be converted back and forth to an > `IpuzClue *` and `IpuzCell *`. There's no guarantee that the > location is valid in the puzzle. All these structs give a semantic reference to the puzzle without requiring a pointer. # Introspection considerations As an additional goal, we have a number of libipuz APIs that accept/return basic GLib data types (mostly `GArray`). Best practice for gobject-introspection is to move to custom types in order to make them bindable. Some considerations. * Structures should be be ref counted for complex data types, and copy/free for simple types. If a structure is mutable and ref counted, then it also needs a `_dup()` function to make a copy of it rather then a ref. This is needed for `ipuz_puzzle_deep_copy()` * Iters are challenging. g-i does not have support for an unmanaged gpointer (such as `CharsetNode`), and rust doesn't make it easy. Wherever possible, we should use an integer-indexed access pattern. * That means iterable structs should support: `_get_count()` and `_index()` methods. # Text and Strings First a note on text handling. All text internally is encoded as `UTF-8`. We validate it on input, and assume it's valid everywhere in the codebase (similar to GTK). In multiple places, we effectively convert the string to `UCS-4` by treating it as a stream of `gunichars`. Because of the gridded nature of crosswords, things get a weird when we break text into discrete cells independent of the larger context. Breaking up a unicode string a character at a time can result in nonsensical results. It works well enough for English and other latin scripts, but long term we need to move to a stream of discrete graphemes. We've taken some initial steps that way with the [misc cluster functions](https://gitlab.gnome.org/jrb/crosswords/-/blob/master/src/crosswords-misc.h?ref_type=heads#L39), but that is a really basic implementations and is just used for paste. Ideally we'd find a robust grapheme library and involve that for indexing strings. > **NOTE:** this is all predicated on graphemes being needed for other > languages, such as Indic scripts. ## The WordList and strings Note, the WordList currently does **not** support multi-codepoint graphemes, though it does theoretically support multi-byte UTF-8 strings. If we ever support a language with a word-list that requires that we will have to reassess this. # Custom Data Types ## Character Set `IpuzCharset` is used in a surprising number of places. It's one of the most versatile structures we have. Some places it's used: * Each puzzle has a charset listing its valid characters * This charset can be passed to PlayCell to filter out letters that aren't part of the puzzle * We use it to keep a histogram of letters in use in a puzzle * We also keep a histogram of clue lengths by casting the length to a gunichar (!) * It's used in the fast path for both the Grid and Acrostic solver. We add and remove characters to them every iteration of the loop. * We calculate the potential overlapping characters in a cell to limit options for users. * etc. **Challenges:** * Currently, we have the mutable charset builder -> charset pattern. That has kept lookups fast for the Acrostic solver, but we've ran into places where we need to keep a mutable charset around. As an example, in `word_list_find_intersections()`, we want to set a builder's value to be a multiple of another factor. There's no way to go through a builder and adjust its values, so we have to create a second one. * We store a unicode codepoint per node. This is fine for all the languages we support, but may not be fine for eg. hindi. **In the future, we may need to index these by a grapheme, instead of a unichar.** More research is needed here. ## Cell Array We have a couple of instances of the CellArray and they have very different performance characteristics. * Selection keeps a sorted array and tests for uniqueness. Each cell should only exist once. * Another common use is every clue has a list of the cells that it refers to * The solver has a list of cells to search through, and removes/adds cells as it traverses the tree and backtracks. (This usecase may need a specialized structure though). This data structure needs to be sorted/unique, or ordered depending on where it's used. **UPDATE:** We've since written [IpuzCellCoordArray](https://libipuz.org/libipuz-1.0/struct.CellCoordArray.html) that handles the case of cells for clues. We'll need a separate structure for selections. ## Grid Words Surprisingly, we have three different representations of words in the code base. * There are basic C strings. These come from user input or the puzzle file. * The `WordList` also has C strings encoded in it. These strings are special, as the bytes before it encode priority and enumerations * Finally, there's the `WordIndex` struct. This struct can be used to lookup a word in a `WordList` We commonly want to index clusters in the string. There is a lot of code that iterates through UTF-8 strings in the codebase that is incorrect (see string handling above) and thus we may want to consider a specialized word type. Consider the following (very artificial) grid entry: ![spinal tap](spinaltap.png) This is encoded with the following UTF-8 string and cell boundaries: ``` S P ı N̈ A L T A P 53 50 C4 B1 4E CC 88 41 4C 54 41 50 00 | | | | | | | ``` the *S* and *P* characters are each one codepoint / byte. The *dotless i* (U+0131) is an additional codepoint and two bytes. The *N-diaresis* is contrived of two codepoints (U+004E and U+0308) and is three bytes. Finally, the last three characters — *TAP* — are all included in one cell as a rebus entry. A very useful data structure would be one that breaks a word into cells, and lets you iterate through them. It may also be useful to be able to indicate the offset of a cell array it exists in the grid in so we can index it. An example of a function that could return this is `ipuz_crossword_get_clue_string_by_id()`. In the above example, `word[6]` would return the string `"TAP."` That being said, outside the context of a grid the offsets are meaningless and potentially misleading, so this would need a bit of thought to make sure it is useful. ## Word Array We have lists of words showing up in multiple places. There's `WordList`, `WordArray`, `WordListModel`,and `WordArrayModel`. In addition, multiple places just use a `GArray` of strings. Some consolidation would be nice at some point, especially if we ever get a good grid word type. ## Filters Filters are used by the word list to describe a partially filled in clue (eg. `"GN??E"`). Currently, filters are represented as strings. We have multiple independent implementations of filter validation code. There are also a couple common operations that we run frequently on filters, such as counting how many '?' characters are in the filter — are multiple fastpaths triggered by totally open or closed filters. Having a simple filter class would be nice. ## Guesses IpuzGuesses is a simple overlay over a puzzle grid. It stores a string-per-cell, and is used to keep the guesses that a user makes in the game. It's currently not a GObject, though there's an outstanding MR to make it one. Recently, the editor started using the guesses object to store pencil markings as a way to record scratch marks. It also is used to show results from the autofill functionality. Looking at other crossword editors, it may be useful to start running a background task that starts calculating things like intersection characters or mark trouble spots. If we start doing that, we'll need to store more per cell than the current string. Given that, storing custom data per cell is a useful extension. > **NOTE:** We'd have to approach this somewhat carefully. Currently > the `PuzzleStack` doesn't keep guesses that are attached to puzzles > and these would be cleared every grid/guess change. We'll have to > test assumptions, and see if keeping the Guesses around could save > computational work. **UPDATE:** Recent work on libipuz has made the Guesses more clear. We are just going to support a user-defined string per cell. Each puzzle type defines the expected semantics of that string. If the editor/puzzle wants to store more information, it's free to define it's own schema. In the worst-case scenario where structured data is required, we can just serialize a `GVariant`. # Proposed Structures Here's the proposed data structures for libipuz (Oct 2024) * IpuzCellCoord * _mut_, _dup_ * IpuzCellCoordArray * _mut_, _dup_, _ref_ * IpuzCellCoordRegion — _unimplemented_ * _mut_, _dup_, _ref_ * IpuzClueId * _mut_, _dup_ * IpuzCharsetBuilder * _mut_, _dup_ * IpuzCharset * _ref_ * IpuzEnumeration * _ref_ We will hold off on a separate word structures for now.