v4j

Voynich for Java (v4j) library


Project maintained by mzattera Hosted on GitHub Pages — Theme by mattgraham

Note 008 - Simply the Best Grammar for Voynichese :-)

Last updated Feb. 22nd, 2022.

This note is part of my article I presented at the International Conference on the Voynich Manuscript 2022 ZATTERA (2022).

This note refers to release v.9.0.0 of v4j; links to classes and files refer to this release; files might have been changed, deleted or moved in the current master branch. In addition, some of this note content might have become obsolete in more recent versions of the library.

Working notes are not providing detailed description of algorithms and classes used; for this, please refer to the library code and JavaDoc.

Please refer to the home page for a set of definitions that might be relevant for this working note.

« Home


Abstract

Following the work described in Note 005 and Note 007, I created a grammar for words in the Voynich. This grammar covers almost two thirds of tokens (62% of tokens, or 21.6% of word types) in the concordance version of the manuscript. Moreover, it is the best grammar created so far accordingly to the F1 score, a standard measure widely adopted to quantify performance of classifiers.

In this note, I use the word types “model”, “graph”, “grammar”, “state machine”, and “classifier” more or less interchangeably to indicate a representation of the inner structure of Voynich words, as it is typically possible to move from one representation to an equivalent one (e.g. from a formal grammar to the equivalent state machine, to the graph depicting the machine, and all those can be viewed as classifiers, as explained below).

Methodology

I wrote some code {1} that, starting from the list of word types in the concordance version of the Voynich (see Note 001), creates a state machine to model them. The best version I found is described by the below grammar {2}.

The notation:

S:
	a,b -> X,Y

means that when the machine is in state S, it can emit any of the two strings a or b and then enters in any of the two states X or Y. If we start from the initial state (<BEGIN>) and follow the evolution of the state machine through all possible paths ending with the final state (<END>) we will generate all the word types this model defines as word types in the Voynich. Of course, this list does not match exactly the list of word types in the Voynich (that is, the Voynich vocabulary); how good this match is, it is discussed in the below sections of this note.

<BEGIN>:
	 -> 0_d, 0_q, 0_s, 1_o, 1_y, 2_l, 2_r,
	    3_tpkf, 4_C, 4_S, 5_TPK, 7_d, 8_a

0_d:
	d ->     4_C, 4_S
0_q:
	q -> 1_o
0_s:
	s ->     4_C

1_o:
	o -> 2_r, 3_tpkf, 4_C,     5_TPK, 6_eEB, 7_d, 8_a
1_y:
	y ->      3_tpkf, 4_C, 4_S

2_l:
	l -> 3_tpkf, 4_C, 4_S
2_r:
	r ->                  8_a

3_tpkf:
	t, p, k, f -> 4_C, 6_eEB, 8_a, 8_o, 11_y

4_C:
	ch -> 6_eEB, 8_a, 8_o, 10_d, 11_y
4_S:
	sh -> 6_eEB,      8_o

5_TPK:
	cth, cph, ckh -> 6_eEB, 8_a, 8_o

6_eEB:
	e, ee, eee -> 7_s, 8_o, 10_d, 11_y, <END>

7_d:
	d -> 8_a, 8_o
7_s:
	s ->          <END>

8_a:
	a -> 9_iJ,       10_l, 10_m, 10_n, 10_r
8_o:
	o ->       10_d, 10_l,             10_r, <END>

9_iJ:
	i, ii -> 10_n, 10_r

10_d:
	d -> 11_y, <END>
10_l:
	l -> 11_y, <END>
10_m:
	m ->       <END>
10_n:
	n ->       <END>
10_r:
	r ->       <END>
	
11_y:
	y -> <END>

<END>:

The state machine is also depicted below {3} (<BEGIN> and <END> state omitted for clarity). Nodes are named after the letter they represent (as explained in Note 007), weights on edges show how many tokens in the Voynich contain any given sequence of characters.

State Machine

In an attempt to make the above picture and the state machine a bit easier to understand, I have created two diagram where I split the characters between “initials”, which can appear at the beginning of words and that are shown in green, and “finals” that can appear at the end of a word and are shown in orange.

Initial characters in words Final characters in words

Comparison with other works - Classifiers

As discussed in Note 006, several models for the inner structure of Voynich word types have been created. The purpose of these models is to provide a description of such structure in a way that provides some insights (e.g. show that a particular sequence of character is often used). In this process, it is necessary for any model to operate some choices and simplifications in order to highlight what looks like a common pattern disregarding what seems more a “special case”. How can we evaluate how good a model is? How well does it describe Voynich word types, without indulging too much in useless details and exceptions?

An answer to this question comes from the theory of classifiers. We can look at each model as a tool that given a word in Voynichese can tell whether the word is a true Voynich word or not; this is done by looking at the set of word types a model can generate and classify a word type as a Voynich word type if it belongs to this set. In this context, we can have 4 possible cases for each word type T we want to classify:

  Word type T is generated by the model Word type T is NOT generated by the model
Word type T appears in the Voynich True Positive (TP) False Negative (FN)
Word type T does NOT appear in the Voynich False Positive (FP) True Negative (TN)

We can see a model is a good model if the number of true positives and true negatives is high whilst the number of false positives and false negatives is low (ideally 0).

We can define the precision of a model as the percentage of positives that are true positives; that is, the percentage of word types being generated by the model which do appear in the manuscript. From the above table we can easily see that:

Precision = True Positives / All word types generated = TP / (TP + FP)

A model with precision = 1 (or 100%) will generate only words that appear in the Voynich; notice this does not mean it generates all of the Voynich word types. Ideally, a model that generates only the word type ‘daiin’ has 100% precision, as all the word types it generates are true Voynich word types, but it ignores all of the other word types.

We can then define the recall of a model as the percentage of word types in the Voynich that are generated by the model. Again, from the above table we can easily see that:

Recall = True Positives / All word types in the Voynich = TP / (TP + FN)

A model with recall = 1 (or 100%) will generate all word types in the text, but it can generate many more that are not in the text. Potentially, a model that generates all possible combinations of letters in the Voynich alphabet will generate all of the word types in the manuscript (recall=1), but will also generate a huge amount of words that are not in the Voynich (false positives).

You can see there is a tension between precision and recall; one can increase recall by having the model to generate more words, but this increases the risk of generating word types which are not Voynich word types (false positives), thus reducing precision. There is a need of a single number that represents a compromise between best recall and best precision; in the theory of classifiers this number is called F1 score and is defined as:

F1 = 2 * precision * recall / (precision + recall)

You can see F1 score tends to 0 if either precision or recall tend to 0, while it tends to 1 if both recall and precision tend to 1.

The table below compares our grammar with other models described in Note 006, providing their precision, accuracy and F1 score {4}.

“Generated Strings” is the number of strings a model generates. “Positive Tokens” is the percentage of tokens in the Voynich that the model recognizes.

Model Generated Strings True Positives Positive Tokens Precision Recall F1 Score
ROE 120 112 15.954% 0.933 0.022 0.043
STOLFI 143’124’560’075’240’080’000 4’527 97.813% 0.000 0.881 0.000
NEAL_1a 87’480 535 20.083% 0.006 0.104 0.012
NEAL_1b 174’818 1’782 66.013% 0.010 0.347 0.020
NEAL_2 1’311’345 1’049 45.248% 0.001 0.204 0.002
PALMER 4’547 97.280% 0.000 0.884 0.000
VOGT (Recipes) 32’575 424 58.697% 0.013 0.190 0.024
VOGT 32’575 565 55.734% 0.017 0.110 0.030
PELLING 259 32.099% 0.000 0.050 0.000
PELLING_2 1’192 259 32.099% 0.217 0.050 0.081
SLOT 4’643’467 2’617 86.447% 0.001 0.509 0.001
SLOT MACHINE 3’110 1’113 62.040% 0.358 0.216 0.270

Conclusions


Notes

{1} Class BuildSlotStateMachine was used for this purpose. It provides means to generate and evaluate state machines and output them in different formats, including a grammar description and Java code to be used with the v4j library.

{2} A parameter BuildSlotStateMachine requires is the minimum amount of word types using an edge of the state machine to keep the edge alive (that is, the minimum number of word types containing a specific sequence of two characters). For the machine in this note, I used a minimum weight of 10. A machine with higher F1 score (TP=1’194, Precision=0.371, Recall=0.232, F1=0.286) can be created using a minimum weight of 5; its grammar is described in StateMachine05_grammar.txt and a graphical representation suitable for Gephi can be found in StateMachine05.gephi in same folder. I this discussion, I am ignoring it, as it is also slightly more complex that the one presented here.

{3} A version of this graph that can be visualized using Gephi (StateMachine10.gephi) is stored here.

{4} Class WordModelEvaluator was used to create this table.


« Home

Copyright Massimiliano Zattera.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.