The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala

Kameldip Singh Basra | February 2026

Summary

The Voynich Manuscript (Beinecke MS 408, carbon-dated 1404-1438 CE) is identified as a 15th-century Elu-Sinhala pharmaceutical text -- a teaching manual recording a physician's spoken instructions for Ayurvedic preparations. The writing system is a bespoke abugida mapping 27 EVA characters to 14 Sinhala phonemes.

Key results:

90.7% of 35,916 tokens glossable in English (32,572 tokens)
99.3% matching Sinhala dictionary (Tier 1+2+3)
Only 0.7% truly unknown (236 tokens)
Keyword-section clustering: decoded semantic profiles differ by manuscript section (Z=31.81, 0/1000 shuffles), converging with Montemurro & Zanette (2013) from independent direction
Text-image convergence: decoded plant labels independently match Petersen's botanical illustration IDs
Directionality flip: EVA is RTL-optimized; decoded text is LTR-optimized — consistent with abugida encoding (Parisel 2025)
SOV syntax consistent across all manuscript sections (Z=7.04 postpositional, Z=4.19 noun-before-verb; Greenberg 1963 Universal 4)
Pharmaceutical collocations: 16/36 Ayurvedic pairs confirmed, 0/10 random decoders produce any
External vocabulary validation: Z=3.5 against 150 independently-sourced pharmaceutical terms (non-circular)
Fair cross-language test: Sinhala matches 13x more pharmaceutical terms than any other language (equalized vocabulary, 115 concepts, 6 languages)

Validation Philosophy

This work follows an adversarial self-testing methodology. Every structural claim is tested against random decoders using the same decoder architecture with randomized consonant mappings. Tests that failed are documented alongside tests that passed. All scripts are reproducible from the public data.

Honest disclosures:

Panchavidha circularity: The Panchavidha Z=10.5 is partially circular -- H12-decoded terms tested against H12 output. 3/6 terms are not in the Sinhala dictionary. 0/28 standard Sinhala preparation terms match H12 output. The non-circular version is the external pharma test (Z=3.5, 150 independently-sourced terms).
All raw dictionary Z-scores are negative (H12 matches fewer dictionary words than random decoders). The structural tests (SOV, Collocations, External Pharma) provide the stronger evidence because they test whether matched words form coherent medical text, not just raw counts.
The cross-language pharmaceutical Z-score is 1.66 (below 2.0 for statistical significance), though the raw advantage is 13x.
Two tests failed (folio clustering, recipe sequencing) and are documented below.

We actively invite additional tests. If there is a test you believe would strengthen or weaken this hypothesis, please open an issue or submit a PR. See Contributing below.

Repository Structure

├── main.tex                              # Full paper (LaTeX)
├── main.pdf                              # Compiled paper (PDF)
├── paper.md                              # Full paper (Markdown)
├── references.bib                        # Bibliography
├── VALIDATION_LOG.md                     # Chronological adversarial test log
├── data/
│   ├── voynich_eva_transcription.txt     # EVA corpus (Takahashi transcription)
│   ├── decoded_vocabulary.tsv            # 8,493-entry decoded vocabulary
│   ├── sinhala_dictionary.txt            # 1.47M romanized Sinhala dictionary
│   ├── external_pharmaceutical_vocab.tsv # 150 independently-sourced pharma terms
│   ├── crosslang_pharmaceutical_vocab.tsv# 60 concepts x 5 languages
│   ├── equalized_pharmaceutical_vocab.tsv# 115 concepts x 6 languages (fair test)
│   ├── hebrew_wordlist.txt               # Hebrew wordlist
│   ├── tamil_wordlist.txt                # Tamil wordlist
│   ├── hindi_wordlist.txt                # Hindi wordlist
│   ├── turkish_wordlist.txt              # Turkish wordlist
│   ├── latin_wordlist.txt                # Latin wordlist
│   └── arabic_wordlist.txt               # Arabic wordlist
├── scripts/
│   ├── h12_decoder.py                    # Core H12 decoder
│   ├── validate_coverage.py              # 4-tier coverage validation
│   ├── validate_vowel_final.py           # Vowel-final constraint test
│   ├── validate_domain_clustering.py     # Domain clustering analysis
│   ├── validate_phonotactics.py          # Cross-language phonotactic test
│   ├── validate_external_pharma.py       # External vocabulary + Panchavidha random comparison
│   ├── crosslang_discrimination_test.py  # Cross-language discrimination (5 Indic languages)
│   ├── structural_multilang_test.py      # Fair cross-language test (6 languages, equalized vocab)
│   ├── candidate_language_test.py        # Full candidate language comparison (7 languages)
│   ├── decoder_specificity_test.py       # Decoder specificity analysis
│   ├── loanword_isolation_test.py        # Sinhala vs Pali loanword isolation
│   ├── folio_pharma_clustering.py        # Folio-section clustering (FAILED — documented)
│   ├── recipe_sequence_test.py           # Recipe phase ordering (FAILED — documented)
│   ├── keyword_section_clustering.py     # Keyword-section clustering (Montemurro replication)
│   ├── entropy_analysis.py              # Entropy h2 and directionality analysis
│   ├── translate_manuscript.py           # Full manuscript translation
│   └── herbal_plant_guide.py             # Herbal folio plant identification
├── results/
│   ├── structural_evidence_summary.txt   # Master audit of all structural evidence
│   ├── structural_multilang.txt          # Fair cross-language test results
│   ├── panchavidha_validation.txt        # Panchavidha random comparison (Z=10.5)
│   ├── external_pharma_validation.txt    # External pharma validation (Z=3.5 controlled)
│   ├── crosslang_discrimination.txt      # Cross-language discrimination results
│   ├── candidate_language_validation.txt # Full candidate language comparison
│   ├── decoder_specificity.txt           # Decoder specificity analysis
│   ├── loanword_isolation.txt            # Loanword isolation results
│   ├── keyword_section_clustering.txt    # Keyword-section clustering results
│   └── entropy_directionality_analysis.txt # Entropy and directionality analysis
└── translation/
    ├── voynich_translation.md            # Full English translation
    └── herbal_plant_guide.md             # Plant identification guide

Requirements

Python 3.7+
NumPy (pip install numpy) — only required for validate_phonotactics.py

All other scripts use only the Python standard library.

Quick Start

# Decode a single word
python scripts/h12_decoder.py --word daiin
# Output: gena (Sinhala: "having taken")

# Decode the full corpus
python scripts/h12_decoder.py --input data/voynich_eva_transcription.txt --summary

# Run core validation scripts
python scripts/validate_coverage.py              # Tier 1: 90.7%, Total known: 99.3%
python scripts/validate_vowel_final.py           # 99.95% vowel-final
python scripts/validate_domain_clustering.py     # 46% medical (9x over random baseline)
python scripts/validate_phonotactics.py          # Sinhala ranks #2-3 across measures

# Run adversarial random-decoder comparisons
python scripts/validate_external_pharma.py       # Z=3.5 controlled, Panchavidha Z=10.5
python scripts/structural_multilang_test.py      # Fair cross-language: Sinhala 13x (Z=1.66)
python scripts/candidate_language_test.py        # Sinhala #1 by Z-score and raw coverage
python scripts/decoder_specificity_test.py       # H12 specificity analysis
python scripts/loanword_isolation_test.py        # Sinhala vs Pali discrimination

# Run literature-derived analyses
python scripts/keyword_section_clustering.py     # Keyword-section clustering (Z=31.81)
python scripts/entropy_analysis.py               # Entropy h2 and directionality

The H12 Decoder

The H12 decoder maps EVA (Extensible Voynich Alphabet) characters to Sinhala phonemes via abugida rules. The system uses a 14-phoneme inventory that corresponds to the spoken Elu dialect of medieval Sinhala.

Key mappings include:

sh maps to m
o maps to u
d (onset) maps to g
k (medial) maps to g
ch + consonant triggers devoicing
ct maps to th (aspirated dental)
ck maps to kh (aspirated velar)
cp maps to ph (aspirated labial)
q / h are silent (null carriers)

The decoder applies positional rules (onset vs. medial vs. coda) and abugida vowel-handling conventions to produce phonetic Sinhala output from EVA input.

The Key Insight: Spoken, Not Written

The manuscript encodes spoken language -- a phonetic transcription of medieval speech -- not written language in cipher. This explains why:

The 14-phoneme inventory matches pre-12th-century spoken Elu (no /b/, /v/, /f/)
N-gram frequencies match spoken Sinhala (#1 when spoken-weighted)
Edit-distance-1 matches are pronunciation variants, not errors
Compounds run together as pronounced (reflecting continuous speech)
Word-length distribution matches agglutinative spoken language

This "spoken, not written" framing resolves decades of failed cryptographic attacks: the manuscript was never a cipher, and classical decryption methods were searching for the wrong kind of structure.

Statistical Validation

Every test below can be reproduced by running the listed script. All Z-scores are computed against random decoders using the same architecture with randomized consonant mappings.

Adversarial Random-Decoder Tests

Test	Script	Result	Significance
Panchavidha Kashaya Kalpana	`validate_external_pharma.py`	6 decoded dosage form terms, 2,070 tokens, 0/200 random match	Z = 10.5 (CIRCULAR — see disclosures)
SOV syntax (postpositional)	— (see `sov_syntax_output.txt`)	78.0% postposition-after-noun	Z = 7.04
SOV syntax (noun-before-verb)	— (see `sov_syntax_output.txt`)	75.4% noun-before-verb	Z = 4.19
External pharma vocabulary	`validate_external_pharma.py`	7,211 tokens from 150 published terms, 0/38 beat H12 (controlled)	Z = 3.5
Pharmaceutical collocations	— (see `collocation_output.txt`)	16/36 Ayurvedic pairs, 0/10 random decoders produce any	0/10 random
Fair cross-language pharma	`structural_multilang_test.py`	Sinhala 5,171 tokens, 13x next-best (equalized 115-concept vocab)	Z = 1.66
Candidate language comparison	`candidate_language_test.py`	Sinhala #1 by Z-score (-0.34) and raw coverage (49.9%)	Least negative Z
Loanword isolation	`loanword_isolation_test.py`	Sinhala-only Z=1.39, Pali at 25% of Sinhala (consistent with loanwords)	Weak pass
Keyword-section clustering	`keyword_section_clustering.py`	Semantic profiles differ by section (chi2=2096, 1000 shuffles)	Z = 31.81

Structural Validation

Test	Script	Result
Coverage	`validate_coverage.py`	90.7% glossed, 99.3% total known, 0.7% unknown
Vowel-final constraint	`validate_vowel_final.py`	99.95% vowel-final (Elu abugida)
Domain clustering	`validate_domain_clustering.py`	46.1% medical vocabulary (9x over ~5% random baseline)
Phonotactics	`validate_phonotactics.py`	Sinhala #2-3 across CV-pattern measures
Decoder specificity	`decoder_specificity_test.py`	Sinhala least-negative delta Z (-0.34), all languages negative
Directionality flip	`entropy_analysis.py`	EVA RTL-optimized (PP ratio 0.899), decoded LTR-optimized (1.647)
Entropy h2	`entropy_analysis.py`	EVA h2=2.358, decoded h2=2.276 (delta -0.082, not closer to natural)

Honest Disclosures

Finding	Details
All dictionary Z-scores negative	H12 matches fewer dictionary words than random decoders for ALL languages. Sinhala is least negative (-0.34). Random decoders explore more of the dictionary space.
Cross-language Z below 2.0	Fair test Z=1.66. Raw advantage (13x) is clear but not statistically significant by this metric alone.
Generic CV hypothesis	Cannot be fully rejected by dictionary matching alone. The structural tests (Panchavidha, SOV, Collocations) are the discriminating evidence.
Decoded entropy unchanged	H12 decoded h2=2.276 vs EVA h2=2.358 (delta -0.082). Decoding does not move entropy closer to natural language (~3.3 bits).

Failed Tests (Documented for Transparency)

Test	Script	Result	Why It Failed
Folio-section clustering	`folio_pharma_clustering.py`	Z = 0.2	Entire manuscript is medical — no non-medical control sections
Recipe phase ordering	`recipe_sequence_test.py`	All 3 metrics failed	Matched vocabulary too generic to reveal recipe phases

Cross-Modal Convergences (Inherently Random-Proof)

These involve two independent modalities (decoded text + illustrations) converging on the same conclusion. No random decoder can match illustrations.

Petersen botanical IDs: 3 plants independently identified by botanist and by H12 (tamala, kamala, tambula)
Rajas on women's folios: decoded "ra" (menstrual/female principle) concentrates in sections illustrating women
Recipe-illustration coherence: pharmaceutical sections with vessel illustrations decode as processing steps; herbal sections with plant illustrations decode as plant parts

Contributing / Requesting Additional Tests

We welcome adversarial testing. If you can think of a test that would falsify or strengthen this hypothesis, we want to hear it.

Ways to contribute:

Open an issue describing a test you'd like to see. We will implement and run it, and publish the results regardless of outcome (as we have with our failed tests).
Submit a PR with a new validation script. Follow the pattern of existing scripts: decode the corpus with H12, run the same test on 200 random decoders, report the Z-score.
Attempt hostile replication: clone the repo, run the scripts, report what you find.
Challenge a specific claim: if any number in this README doesn't match what you get when running the script, file a bug.

Tests we would particularly value from the community:

Independent Sinhala/Elu linguistic assessment of the decoded text
Additional cross-language discrimination with larger vocabularies
Statistical tests for hidden structure that we haven't thought of
Comparison against other proposed Voynich decipherments using the same random-decoder methodology

The full audit trail of all tests (passed and failed) is in results/structural_evidence_summary.txt and VALIDATION_LOG.md.

Citation

@article{Basra2026,
  author  = {Basra, Kameldip Singh},
  title   = {The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala},
  year    = {2026},
  doi     = {10.5281/zenodo.18611029},
  url     = {https://doi.org/10.5281/zenodo.18611029}
}

License

This work is licensed under CC BY-NC 4.0 -- free for personal and academic research use. For commercial licensing, contact kameldipbasra@gmail.com. See LICENSE file.

Contact

kameldipbasra@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala

Summary

Validation Philosophy

Repository Structure

Requirements

Quick Start

The H12 Decoder

The Key Insight: Spoken, Not Written

Statistical Validation

Adversarial Random-Decoder Tests

Structural Validation

Honest Disclosures

Failed Tests (Documented for Transparency)

Cross-Modal Convergences (Inherently Random-Proof)

Contributing / Requesting Additional Tests

Citation

License

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
references		references
results		results
scripts		scripts
translation		translation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VALIDATION_LOG.md		VALIDATION_LOG.md
main.pdf		main.pdf
main.tex		main.tex
paper.md		paper.md
references.bib		references.bib

License

kamb-code/Voynich

Folders and files

Latest commit

History

Repository files navigation

The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala

Summary

Validation Philosophy

Repository Structure

Requirements

Quick Start

The H12 Decoder

The Key Insight: Spoken, Not Written

Statistical Validation

Adversarial Random-Decoder Tests

Structural Validation

Honest Disclosures

Failed Tests (Documented for Transparency)

Cross-Modal Convergences (Inherently Random-Proof)

Contributing / Requesting Additional Tests

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages