# Fundamental Puzzle Type Survey
**Status:** Proposal | Partially Implemented | July 2024

Over the course of building Crosswords, we've identified a number of
puzzle-specific data structures that are needed by the code base. Some
of these have robust implementations, others are partially
implemented, while some are reimplemented multiple places in the code
base or just don't exist.

This document is an attempt to list all the fudamental types we may
need as well as list some of the performance tradeoffs associated with
them. The end goal is to move a number of these types into libipuz so
that it can be used universally.


# Fundamental Structs

We have three fundamental types that are used everywhere: The coord,
the clueid and the unichar. These types represent a clue, a cell, and
a character respectively. Of these `IpuzCellCoord` is the most
embedded in the codebase.

```C
typedef struct
{
  IpuzClueDirection direction;
  guint index;
} IpuzClueId;

typedef struct
{
  guint row;
  guint column;
} IpuzCellCoord;

typedef guint32 gunichar;
```

> **NOTE:** the first two items refer to a _location_ in a puzzle, and
> are not a pointer. They can be converted back and forth to an
> `IpuzClue *` and `IpuzCell *`. There's no guarantee that the
> location is valid in the puzzle.

All these structs give a semantic reference to the puzzle without
requiring a pointer.

# Introspection considerations

As an additional goal, we have a number of libipuz APIs that
accept/return basic GLib data types (mostly `GArray`). Best practice
for gobject-introspection is to move to custom types in order to make
them bindable. Some considerations.

* Structures should be be ref counted for complex data types, and
  copy/free for simple types. If a structure is mutable and ref
  counted, then it also needs a `_dup()` function to make a copy of
  it rather then a ref. This is needed for `ipuz_puzzle_deep_copy()`
* Iters are challenging. g-i does not have support for an unmanaged
  gpointer (such as `CharsetNode`), and rust doesn't make it
  easy. Wherever possible, we should use an integer-indexed access
  pattern.
    * That means iterable structs should support: `_get_count()` and
      `_index()` methods.

# Text and Strings

First a note on text handling. All text internally is encoded as
`UTF-8`. We validate it on input, and assume it's valid everywhere in
the codebase (similar to GTK). In multiple places, we effectively
convert the string to `UCS-4` by treating it as a stream of
`gunichars`.

Because of the gridded nature of crosswords, things get a weird when
we break text into discrete cells independent of the larger
context. Breaking up a unicode string a character at a time can result
in nonsensical results. It works well enough for English and other
latin scripts, but long term we need to move to a stream of discrete
graphemes. We've taken some initial steps that way with the [misc
cluster
functions](https://gitlab.gnome.org/jrb/crosswords/-/blob/master/src/crosswords-misc.h?ref_type=heads#L39),
but that is a really basic implementations and is just used for paste.

Ideally we'd find a robust grapheme library and involve that for
indexing strings.

> **NOTE:** this is all predicated on graphemes being needed for other
> languages, such as Indic scripts.

## The WordList and strings

Note, the WordList currently does **not** support multi-codepoint
graphemes, though it does theoretically support multi-byte UTF-8
strings. If we ever support a language with a word-list that requires
that we will have to reassess this.

# Custom Data Types

## Character Set

`IpuzCharset` is used in a surprising number of places. It's one of
the most versatile structures we have. Some places it's used:

* Each puzzle has a charset listing its valid characters
* This charset can be passed to PlayCell to filter out letters that
  aren't part of the puzzle
* We use it to keep a histogram of letters in use in a puzzle
   * We also keep a histogram of clue lengths by casting the length to
     a gunichar (!)
* It's used in the fast path for both the Grid and Acrostic solver. We
  add and remove characters to them every iteration of the loop.
* We calculate the potential overlapping characters in a cell to limit
  options for users.
* etc.


**Challenges:**
* Currently, we have the mutable charset builder -> charset
  pattern. That has kept lookups fast for the Acrostic solver, but
  we've ran into places where we need to keep a mutable charset
  around. As an example, in `word_list_find_intersections()`, we want
  to set a builder's value to be a multiple of another factor. There's
  no way to go through a builder and adjust its values, so we have to
  create a second one.
* We store a unicode codepoint per node. This is fine for all the
  languages we support, but may not be fine for eg. hindi. **In the
  future, we may need to index these by a grapheme, instead of a
  unichar.** More research is needed here.

## Cell Array

We have a couple of instances of the CellArray and they have very
different performance characteristics.

* Selection keeps a sorted array and tests for uniqueness. Each cell
  should only exist once.
* Another common use is every clue has a list of the cells that it
  refers to
* The solver has a list of cells to search through, and removes/adds
  cells as it traverses the tree and backtracks. (This usecase may
  need a specialized structure though).

This data structure needs to be sorted/unique, or ordered depending on
where it's used.

**UPDATE:** We've since written
[IpuzCellCoordArray](https://libipuz.org/libipuz-1.0/struct.CellCoordArray.html)
that handles the case of cells for clues. We'll need a separate
structure for selections.

## Grid Words

Surprisingly, we have three different representations of words in the
code base.

* There are basic C strings. These come from user input or the puzzle file.
* The `WordList` also has C strings encoded in it. These strings are
  special, as the bytes before it encode priority and enumerations
* Finally, there's the `WordIndex` struct. This struct can be used to
  lookup a word in a `WordList`

We commonly want to index clusters in the string. There is a lot of
code that iterates through UTF-8 strings in the codebase that is
incorrect (see string handling above) and thus we may want to consider
a specialized word type. Consider the following (very artificial) grid
entry:

![spinal tap](spinaltap.png)

This is encoded with the following UTF-8 string and cell boundaries:

```
S  P  ı     N̈        A  L  T  A  P
53 50 C4 B1 4E CC 88 41 4C 54 41 50 00
  |  |     |        |  |  |        |
```

the *S* and *P* characters are each one codepoint / byte. The *dotless
i* (U+0131) is an additional codepoint and two bytes. The *N-diaresis*
is contrived of two codepoints (U+004E and U+0308) and is three
bytes. Finally, the last three characters — *TAP* — are all included
in one cell as a rebus entry.

A very useful data structure would be one that breaks a word into
cells, and lets you iterate through them.

It may also be useful to be able to indicate the offset of a cell
array it exists in the grid in so we can index it. An example of a
function that could return this is
`ipuz_crossword_get_clue_string_by_id()`.

In the above example, `word[6]` would return the string `"TAP."` That
being said, outside the context of a grid the offsets are meaningless
and potentially misleading, so this would need a bit of thought to
make sure it is useful.

## Word Array

We have lists of words showing up in multiple places. There's
`WordList`, `WordArray`, `WordListModel`,and `WordArrayModel`. In
addition, multiple places just use a `GArray` of strings. Some
consolidation would be nice at some point, especially if we ever get a
good grid word type.

## Filters

Filters are used by the word list to describe a partially filled in
clue (eg. `"GN??E"`). Currently, filters are represented as
strings. We have multiple independent implementations of filter
validation code. There are also a couple common operations that we run
frequently on filters, such as counting how many '?' characters are in
the filter — are multiple fastpaths triggered by totally open or
closed filters. Having a simple filter class would be nice.

## Guesses

IpuzGuesses is a simple overlay over a puzzle grid. It stores a
string-per-cell, and is used to keep the guesses that a user makes in
the game. It's currently not a GObject, though there's an outstanding
MR to make it one.

Recently, the editor started using the guesses object to store pencil
markings as a way to record scratch marks. It also is used to show
results from the autofill functionality. Looking at other crossword
editors, it may be useful to start running a background task that
starts calculating things like intersection characters or mark trouble
spots. If we start doing that, we'll need to store more per cell than
the current string. Given that, storing custom data per cell is a
useful extension.

> **NOTE:** We'd have to approach this somewhat carefully. Currently
> the `PuzzleStack` doesn't keep guesses that are attached to puzzles
> and these would be cleared every grid/guess change. We'll have to
> test assumptions, and see if keeping the Guesses around could save
> computational work.

**UPDATE:** Recent work on libipuz has made the Guesses more clear. We
are just going to support a user-defined string per cell. Each puzzle
type defines the expected semantics of that string. If the
editor/puzzle wants to store more information, it's free to define
it's own schema. In the worst-case scenario where structured data is
required, we can just serialize a `GVariant`.

# Proposed Structures

Here's the proposed data structures for libipuz (Oct 2024)

* IpuzCellCoord
    * _mut_, _dup_
* IpuzCellCoordArray
    * _mut_, _dup_, _ref_
* IpuzCellCoordRegion — _unimplemented_
    * _mut_, _dup_, _ref_
* IpuzClueId
    * _mut_, _dup_
* IpuzCharsetBuilder
    * _mut_, _dup_
* IpuzCharset
    * _ref_
* IpuzEnumeration
    * _ref_

We will hold off on a separate word structures for now.