v4j

Voynich for Java (v4j) library


Project maintained by mzattera Hosted on GitHub Pages — Theme by mattgraham

Note 003 - Clustering

Last updated Mar. 20th, 2022.

This note refers to release v.3.0.0 of v4j; links to classes and files refer to this release and files might have been changed, deleted or moved in the current master branch. In addition, some of this note content might have become obsolete in more recent versions of the library.

Working notes are not providing detailed description of algorithms and classes used; for this, please refer to the library code and JavaDoc.

Please refer to the home page for a set of definitions that might be relevant for this working note.

« Home


Abstract

This note discuss the application of k-means clustering algorithms to Voynich pages, showing how the word types in the page strongly correlate with the page illustration type (Herbal, Biological, Pharmaceutical, etc.) and Currier’s language (see CURRIER (1976)).

Previous Works

I am not the first one to apply this approach to the Voynich, just google “Voynich clustering” and you will find many articles and blog posts o the topic.

Probably one of the first works in this area was that by D’Imperio ([D’IMPERIO 1978a).

I reserve the option to go over these publications in the future and compare them with the contents of this note.

Methodology

Our starting point is the Voynich majority transliteration of the text (see v4j README); I use the EVA alphabet, but it is not relevant for this discussion, as I look at whole words in the Voynich, not to their inner components.

Embedding and Distance Measure

The text is split into units for analysis, that could be single pages or bigger portions of text (e.g. parchments / bi-folios). Each unit is embedded as a bag of words where the dimensions are the “readable” word types in the Voynich (that is, Voynich “words” with no “unreadable” characters {1}) and the value for the dimension is the number of times corresponding word type appears in the text unit.

Similarity between textual units is computed as positive angular distance of corresponding embedding; this returns angular distance between two vectors assumed to have only positive components.

Outliers

Before clustering, I look for “outliers”; that is, textual units which appear very dissimilar from other textual units.

Based on this analysis {2}), I defined the following outliers, which are removed from the text before clustering.

Preliminary Exploration

The TensorBoard Embedding Projector has been used to do a preliminary, quick and visual investigation about clustering Voynich pages {3}.

The below images have been obtained using the projector with following parameters; ; a pre-populated version, is available for your own exploration. T-SNE 2D projection, Label By=ID, Color By=Illustration + Language, Perplexity=5, Learning rate=0.01, Supervise=0, Iteration=10'000.

T-SNE visualization of Voynich pages

Currier’s Language

The image below shows how pages tend to form three distinct clusters, which are highly correlated with Currier’s languages (A or B).

T-SNE visualization of Voynich pages by language

Biological Pages

These pages cluster closely together.

T-SNE visualization of Voynich Biological pages

Stars Pages

The stars pages tend to cluster together, next to the Biological pages (they are all written in Currier’s B language).

T-SNE visualization of Voynich Stars pages

Herbal B Pages

The herbal pages written with Currier’s language B tend to cluster together, well separated from Herbal A pages.

T-SNE visualization of Voynich Herbal B pages

Zodiac Pages

The zodiac pages tend to cluster together, next to Herbal B pages.

T-SNE visualization of Voynich Zodiac pages

Herbal A Pages

The herbal pages written with Currier’s language A tend to cluster together, well separated from Herbal B pages.

T-SNE visualization of Voynich Herbal A pages

Pharmaceutical Pages

Those pages tend to cluster together, next to but separated from Herbal A; to be noticed that all Pharmaceutical pages are written using Currier’s language A.

T-SNE visualization of Voynich Pharmaceutical pages

Astronomical Pages

These pages are grouped in two big parchments; f67 and f68.

T-SNE visualization of Voynich Astronomical pages

Cosmological Pages

These pages tend to disperse in the dimension space.

T-SNE visualization of Voynich Cosmological pages

K-Means Clustering

The below table summarizes the result of clustering the manuscript pages using K-Means clustering {4}:

K-Means clustering of Voynich pages

We can see that:

In order to remove some noise, and noticing that in the vast majority of cases pages in a parchment share illustration type and language, I performed the clustering again, this time splitting the manuscript by parchment. Notice that parchments 29, 31, 32, 40 have been excluded as they contain Cosmological or Astronomical pages, which we know already do not cluster well.

The results are shown below (they are also available in the TensorFlow projector); following parameters have been used: T-SNE 2D projection, Label By=ID, Color By=Illustration + Language, Perplexity=5, Learning rate=1, Supervise=0, Iteration=1'000.

T-SNE visualization of Voynich parchments

We can see that there a strong tendency for parchments to cluster based on their illustration type and language, with two notable exceptions:

K-Means clustering of the parchments, confirms what I already found while clustering single pages (table shows page count for each cluster, some of the smallest clusters omitted for clarity):

K-Means clustering of Voynich parchments

I had a further deeper look into language A and B separately.

The below image shows the results of clustering Pharmaceutical and Herbal A parchments (table shows page count for each cluster, some of the smallest clusters omitted for clarity):

K-Means clustering of Voynich parchments in language A

The below image shows the results of clustering Biological, Stars, and Herbal B parchments (table shows page count for each cluster, some of the smallest clusters omitted for clarity):

K-Means clustering of Voynich parchments in language B

Conclusions

Based on the above clustering analysis we can conclude that:


Notes

{1} See v4j README.

{2} The class OutlierDetection is used to calculate average distance of each page from other pages in the text. The output of the class (PageEmbeddingDistance.xlsx) can be found in the analysis folder.

{3} The class BuildBoW can be used to generate data that can be uploaded to TensorFlow projector for visualization. The output of this class, in the form of “vector” and “metadata” .TSV files, can be found in this folder both for single pages or entire parchments.

{4} Class KMeansClusterByWords performs K-Means clustering and prints out a report that can be easily converted in an Excel file. The class can be parameterized to run different types of experiments; its outputs, with some additional data, can be found as Excel files in the analysis folder. Keep in mind K-Means algorithm include some randomness, therefore slightly different clustering might result at each experiment.

{5} After publishing note 009, I decided to remove the zodiac pages (former “ZZ” cluster) from this list of clusters, since there is no much evidence their cluster is better formed than Cosmological or Astronomical ones. I also noticed that parchment 25 has been wrongly excluded by this table, which is now re-generated using code in release v.12.0.0.

{6} CURRIER (1976) contains this comment: “The Newbold foliation indicates that the Biological Section extends through ff 85-86 and it would appear from the illustrations that the Pharmaceutical Section does not begin until f 87. However, frequency counts before and after the break at f 84/f 85 indicate a change from Biological material to something else.”.


« Home

Copyright Massimiliano Zattera.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.