# Thoughts on scoring words

**Status:** Draft | January 2025

What makes a word a good word for a crossword? What makes it
interesting? Some words might entertain a solver, such as
`TANTALIZE`. Others might disgust or frustrate them: consider `MOIST`
or `SSW` (the direction). Some words are overused and can cause eyes
to roll — `EGOT` and `OLEO` — while others can excite and provoke
wonder.

Good words are the backbone of any word puzzle. When combined into a
grid, it almost becomes a form of poetry: a combination of words that
engage and delight the solver. It can be predetermined in places as
the letters and words require, but it can have moments of whimsy and
surprise. Or perhaps clever moments that provoke thought.

This document is an attempt to enumerate measurable dimensions for
words that could be interesting for word puzzles, and propose a few
ways of using that to score word lists. There are no absolutes when it
comes to language: everyone's lived experience is different and the
language they know and speak and are familiar with ranges from person
to person. A word getting a score here may underrepresent its
value. Nevertheless, this attempts to provide some structure.

> Crossword setters generally try to write for a common audience and
> make their puzzles accessible. However, there are no absolutes when
> it comes to culture. Words that are familiar to some are obscure to
> others. There's a reason crosswords thrive in local newspapers. A
> common location provides at least some common grounding for a setter
> to target.
>
> For a wonderful musing and history of this from a gendered
> perspective, read [Anna Shechtman's _The Riddles of the
> Sphinx_](https://www.harpercollins.com/products/the-riddles-of-the-sphinx-anna-shechtman?variant=42834880167970).

## Puzzle Kinds

It's worth noting that how each puzzle kind uses words affects how
they approach words. Standard crosswords use a lot of filler words,
and may have less flexibility as to what to choose. On the other hand,
cryptics have a lot more ability to choose their words carefully and
relatively few words in their grids.

# Overall approach

We are proposing scores to words to get better results when creating
grids. These scores would be surfaced both in the _Word List_ and
_Autofill_ functionalities. We start with the following assumptions:

* _**Variety is key**_ First and foremost, having a good set of
  different types of words keeps the solver entertained and engaged.
* _**Don't clump traits**_ It's bad form to have too
  much similarity in a section of a puzzle.
* _**Where possible use familiar words...**_ It's fine to send your
  solver to the dictionary for some words, but if they need a
  dictionary to make any progress you might be making it too hard.
* _**...but not too many**_ Expecting to stretch your users' vocabulary
  is a plus. In addition, occasionally you need to reach for an
  obscure word to make an otherwise strong section fill.
* _**Human editing is best**_ Perhaps in the future it's possible to
  have AI create high quality grids, but the best ones will still have
  a high degree of human intervention. This is meant as an assistive
  tool and shouldn't be used to override editorial control.

## Traits

We propose a few measurable **traits** for a word that can have a
numerical rating. These dimensions can be used to drive variety in a
grid, and give the autosolver something to work with beyond word
shape. These ratings are considered independently of the grid being
filled and can be precomputed before hand. The traits proposed are:

1. Lexical interest
1. Frequency
1. Familiarity
1. Definition count
1. Sentiment

Each word can have a score for each trait. That would give the setter
the ability to assess the overall grid and make decisions. It could
also be used by the autosolver to pick better words.

# Details

We propose a way of measuring each of the traits below. For each
trait, we discuss how to measure it and touch a little on how to
calculate this. It will take quite some experimentation to build a
practical score,

## Lexical Interest: Bigrams and Trigrams

Unusual looking words that catch the eye are a often a plus in
crosswords, and a good way to differentiate. One way to make a word
unusual is to have an unexpected run of characters.

For example, I would argue that `KNAVE` is more interesting than
`THINK`. They both have an `N` and a `K` in them, but the `KN` bigram
is rarer than `NK`. Likewise, there are some trigrams that are fairly
rare — for example `OXC` in `OXCART`.

This is an easier score to calculate as we don't need additional
datasets. Go through the word and see if any pair or triple of letters
is unexpected. If any of them are pass a threshold, we add it to the
score.

## Frequency

How often a word is used may be an interesting characteristic. A word
thats used a lot. Fortunately, we can use the google ngram dataset to
calculate the frequency of each word. There are also great lists of
words used in existing puzzles that we could use to constrain it to
crosswords.

**Links:**
* https://books.google.com/ngrams/
* https://cryptics.georgeho.org/

## Familiarity

Familiarity is akin to frequency but is different rating. Words can be
familiar to solvers and not be in common parlance. Familiarity is
harder to determine, though there are efforts out there to build a
table. We'll have to research this.

**Links:**
* https://arxiv.org/pdf/1806.03431

## Number of definitions and parts of speech

This trait is particularly crossword-centric. Some words (think `SET`
and `RUN`) have a lot of different meanings, and are useful for
cryptics. We could compose a score valuing words that have a higher
number of definitions or multiple parts of speech. We have the data to
determine this already.

## Sentiment and beauty / Swearing and profanity

It's possible to determine the sentiment around a word. People have
done surveys to determine if it's positive or negative. Along the same
lines, there are profane words that people probably don't want to see
while solving a crossword over breakfast.

> **NOTE:** The Peter Broda list has its own scoring
> system. Empirically, it strongly values profanity and crassness in
> its list. We may want to separate profanity as a separate trait from
> sentiment.

**Links:**
* https://github.com/stdlib-js/datasets-liu-positive-opinion-words-en
* https://en.wikipedia.org/wiki/Phonaesthetics
* https://www.sciencedirect.com/science/article/abs/pii/0749596X86900215
* https://github.com/surge-ai/profanity

# Other possibilitiies

It's worth talking about a few things that are too situational to be a
good dimension, or are hard to measure/calculate.

## Word Shape

The _shape_ of a word. That is to say, the graphemes that are combined
to create it. This is highly situational and can't be precomputed. For
example, consider a Standard Crosswords with the word `WHEY` in
it. That word will work really well in the last row, as every letter
in it is a valid and relatively high-frequency last letter for the
down clues. If it shifts up a row, you start running into
problems. Words ending in `Y?` and `H?` are more rare, and it's not
nearly as good a word in that position. The same word would have very
different scores based on where it is.

The autofill algorithm tries to acount for that by checking the
crossing words for their frequency. As a result, we can skip this
factor when precomputing scores.

## Clue words

Crosswords are primarily about the clues, of course. Some words —
especially for cryptics — just lead to good clues. Words with other
words inside them, or words clever common homonyms or anagrams. The
fact that `SILENT` and `LISTEN` are anagrams is a good (though
overused) example of this. The afforementioned `WHEY` is a homonym of
`WAY` and `WEIGH`, which makes it also valuable in cryptics.

It's also worth considering word fragments. For example, words with
other words embedded with them make for good cryptic answers too (and
great for rebus puzzles).

I don't have a good concept of how to measure this trait, yet.