Word List Definitions
Status: Final | Implemented | November 2024
Author: Jonathan Blandford
Each word-list is kept in a GResource file that can be loaded and unloaded as per the user’s settings.
We extract the definitions for each word in the wordlist from wiktionary and put it in the .gresource file alongside the wordlist index data. This is a complicated and time-intensive process, taking tens of minutes per build. The first step is also to download the raw wiktionary data which is ~20Gb of data.
As a result of that We keep a the pre-computed binary GResource file in git and data rather than build them from first-principles.
NOTE: One result of this is that we tend to build into
$SOURCE_DIR
rather than$BUILD_DIR
.
Files
def-extractor.py
will generate a number of files.
word-lists/{name}-filtered-wiktextract-data.jsonl
word-lists/{name}-gvariant-defs.data
word-lists/{name}-gvariant-index.data
word-lists/{name}-enums.json
Note: Files in bold are stored in git and updated when the word-list or format changes.
Resource file: {name}-filtered-wiktextract-data.jsonl
This is an intermediate file and is not kept in git. It contains all the definitions of words from the word list pulled from the main wiktionary data. The remainder of the operations use this as an optimization.
NOTE:: the
.jsonl
extension indicates that it’s a list of json blocks each stored on their own line. The entire file isn’t valid json, but each line is. This is the same format that the raw data comes in.
Resource file: {name}-gvariant-defs.data
This contains all the definitions concatted together. Each definition
is stored as a GVariant
with a type signature of
"(sa(ysa(maqas)))"
. Breaking it down:
(s a( y s a( maq as)))
WORD ENTRY-LIST POS HEADER SENSE-LIST OPTIONAL-TAGS-LIST GLOSSES-LIST
Resource file: {name}-gvariant-index.data
This contains a sorted list of the hash of the FILTER
, the offset in
the defs file of the definition GVariant
data, and the length of
it. Each chunk is padded to 12 bytes, and can be binary searched to
find the hash.
NOTE: Like anagrams, we don’t worry about hash collisions when storing the index. We store all filters with the same hash together in the same block. We then walk the whole block and look at the stored word for each definition to see if it’s one that matches the filter we want.
Resource file: {name}-enums.json
This is a json file containing a list of both tags and POS names. These are stored as
Defs creation steps
Here are the steps needed to create a new word-list:
Definition data
First, download the raw wiktionary data from the wiktextract site. There’s information there on how to generate those files, but they warn it takes a day-and-a-half for each run so we start with the pregenerated ones.
% cd word-lists/
% curl -O https://kaikki.org/dictionary/raw-wiktextract-data.jsonl.gz
% gunzip raw-wiktextract-data.jsonl.gz
Update existing word-lists
To update the git files to a new dictionary (or newer versions of the existing word-lists), simply run:
meson compile -C _build/ build-wordlist-defs
That will run scripts/build-wordlist-defs.sh
Adding a new word-list
This isn’t really supported at this time. In concept though, you’d
have to add a load_wordlist()
function to ‘def-extractor.py’ similar
to existing ones. Extend the arg parser and also
generate_filtered_list()
. The rest of the code should work fine.
Then, from ${MESON_SOURCE_ROOT}/tools/wiktionary-extractor/
run:
% ./def-extractor.py $WORDLIST filtered-list
% ./def-extractor.py $WORDLIST enums
% ./def-extractor.py $WORDLIST gvariant-list
With the $WORDLIST set to the name you gave your wordlist.
Glossary:
Filter: the crossword-suitable set of clusters representing a word. For example,
MOUSE
. Could also beCAT?
, though we won’t match that to a definition. This concept is shared with the word-list.Word: a word with a fixed spelling. Multiple words can have the same filter. Examples include “aard-vark”/”aardvark” which shares a filter of
AARDVARK
. Another example is “A.A.” (the initialism) and “Aa” (the river in France). They both share a filter ofAA
. One word may have multiple entries.Entry: A set of semantic meanings of the word, rooted in a common part-of-speech. An entry can contain multiple senses.
Sense: The essesence of a word. One sense may need have multiple glosses to describe it.
Gloss: A human-parseable sentence describing a sense. It is often circular and may contain examples. This is a gloss.
POS: Acronym for Part of Speech