# Word List Definitions **Status:** Final | Implemented | November 2024 **Author:** Jonathan Blandford Each word-list is kept in a GResource file that can be loaded and unloaded as per the user's settings. We extract the definitions for each word in the wordlist from wiktionary and put it in the .gresource file alongside the wordlist index data. This is a complicated and time-intensive process, taking tens of minutes per build. The first step is also to download the raw wiktionary data which is ~20Gb of data. As a result of that **We keep a the pre-computed binary GResource file in git** and data rather than build them from first-principles. > **NOTE:** One result of this is that we tend to build into > `$SOURCE_DIR` rather than `$BUILD_DIR`. ## Files `def-extractor.py` will generate a number of files. * `word-lists/{name}-filtered-wiktextract-data.jsonl` * **`word-lists/{name}-gvariant-defs.data`** * **`word-lists/{name}-gvariant-index.data`** * **`word-lists/{name}-enums.json`** > **Note:** Files in bold are stored in git and updated when the > word-list or format changes. ### Resource file: `{name}-filtered-wiktextract-data.jsonl` This is an intermediate file and is not kept in git. It contains all the definitions of words from the word list pulled from the main wiktionary data. The remainder of the operations use this as an optimization. > **NOTE:**: the `.jsonl` extension indicates that it's a list of json > blocks each stored on their own line. The entire file isn't valid > json, but each line is. This is the same format that the raw data > comes in. ### Resource file: `{name}-gvariant-defs.data` This contains all the definitions concatted together. Each definition is stored as a `GVariant` with a type signature of `"(sa(ysa(maqas)))"`. Breaking it down: ``` (s a( y s a( maq as))) WORD ENTRY-LIST POS HEADER SENSE-LIST OPTIONAL-TAGS-LIST GLOSSES-LIST ``` ### Resource file: `{name}-gvariant-index.data` This contains a sorted list of the hash of the `FILTER`, the offset in the defs file of the definition `GVariant` data, and the length of it. Each chunk is padded to 12 bytes, and can be binary searched to find the hash. > **NOTE:** Like anagrams, we don't worry about hash collisions when > storing the index. We store all filters with the same hash together > in the same block. We then walk the whole block and look at the > stored word for each definition to see if it's one that matches the > filter we want. ### Resource file: `{name}-enums.json` This is a json file containing a list of both tags and POS names. These are stored as ## Defs creation steps Here are the steps needed to create a new word-list: ### Definition data First, download the raw wiktionary data from the [wiktextract site](https://kaikki.org/dictionary/rawdata.html). There's information there on how to generate those files, but they warn it takes a day-and-a-half for each run so we start with the pregenerated ones. ```shell % cd word-lists/ % curl -O https://kaikki.org/dictionary/raw-wiktextract-data.jsonl.gz % gunzip raw-wiktextract-data.jsonl.gz ``` ### Update existing word-lists To update the git files to a new dictionary (or newer versions of the existing word-lists), simply run: ```shell meson compile -C _build/ build-wordlist-defs ``` That will run `scripts/build-wordlist-defs.sh` ### Adding a new word-list This isn't really supported at this time. In concept though, you'd have to add a `load_wordlist()` function to 'def-extractor.py' similar to existing ones. Extend the arg parser and also `generate_filtered_list()`. The rest of the code should work fine. Then, from `${MESON_SOURCE_ROOT}/tools/wiktionary-extractor/` run: ```shell % ./def-extractor.py $WORDLIST filtered-list % ./def-extractor.py $WORDLIST enums % ./def-extractor.py $WORDLIST gvariant-list ``` With the $WORDLIST set to the name you gave your wordlist. ### Glossary: * **Filter:** the crossword-suitable set of clusters representing a word. For example, `MOUSE`. Could also be `CAT?`, though we won't match that to a definition. This concept is shared with the word-list. * **Word:** a word with a fixed spelling. Multiple words can have the same filter. Examples include *"aard-vark"*/*"aardvark"* which shares a filter of `AARDVARK`. Another example is _"A.A."_ (the initialism) and _"Aa"_ (the river in France). They both share a filter of `AA`. One word may have multiple entries. * **Entry:** A set of semantic meanings of the word, rooted in a common part-of-speech. An entry can contain multiple *senses.* * **Sense:** The essesence of a word. One sense may need have multiple glosses to describe it. * **Gloss:** A human-parseable sentence describing a sense. It is often circular and may contain examples. This is a gloss. * **POS:** Acronym for Part of Speech