Skip to content

The Voynich Manuscript deciphered as a phonetic transcription of spoken Elu-Sinhala. Includes decoder, 4,591-entry vocabulary, and statistical validation.

License

Notifications You must be signed in to change notification settings

kamb-code/Voynich

Repository files navigation

The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala

DOI

Kameldip Singh Basra | February 2026

Summary

The Voynich Manuscript (Beinecke MS 408, carbon-dated 1404-1438 CE) is identified as a 15th-century Elu-Sinhala pharmaceutical text -- a teaching manual recording a physician's spoken instructions for Ayurvedic preparations. The writing system is a bespoke abugida mapping 27 EVA characters to 14 Sinhala phonemes.

Key results:

  • 90.7% of 35,916 tokens glossable in English (32,572 tokens)
  • 99.3% matching Sinhala dictionary (Tier 1+2+3)
  • Only 0.7% truly unknown (236 tokens)
  • Keyword-section clustering: decoded semantic profiles differ by manuscript section (Z=31.81, 0/1000 shuffles), converging with Montemurro & Zanette (2013) from independent direction
  • Text-image convergence: decoded plant labels independently match Petersen's botanical illustration IDs
  • Directionality flip: EVA is RTL-optimized; decoded text is LTR-optimized — consistent with abugida encoding (Parisel 2025)
  • SOV syntax consistent across all manuscript sections (Z=7.04 postpositional, Z=4.19 noun-before-verb; Greenberg 1963 Universal 4)
  • Pharmaceutical collocations: 16/36 Ayurvedic pairs confirmed, 0/10 random decoders produce any
  • External vocabulary validation: Z=3.5 against 150 independently-sourced pharmaceutical terms (non-circular)
  • Fair cross-language test: Sinhala matches 13x more pharmaceutical terms than any other language (equalized vocabulary, 115 concepts, 6 languages)

Validation Philosophy

This work follows an adversarial self-testing methodology. Every structural claim is tested against random decoders using the same decoder architecture with randomized consonant mappings. Tests that failed are documented alongside tests that passed. All scripts are reproducible from the public data.

Honest disclosures:

  • Panchavidha circularity: The Panchavidha Z=10.5 is partially circular -- H12-decoded terms tested against H12 output. 3/6 terms are not in the Sinhala dictionary. 0/28 standard Sinhala preparation terms match H12 output. The non-circular version is the external pharma test (Z=3.5, 150 independently-sourced terms).
  • All raw dictionary Z-scores are negative (H12 matches fewer dictionary words than random decoders). The structural tests (SOV, Collocations, External Pharma) provide the stronger evidence because they test whether matched words form coherent medical text, not just raw counts.
  • The cross-language pharmaceutical Z-score is 1.66 (below 2.0 for statistical significance), though the raw advantage is 13x.
  • Two tests failed (folio clustering, recipe sequencing) and are documented below.

We actively invite additional tests. If there is a test you believe would strengthen or weaken this hypothesis, please open an issue or submit a PR. See Contributing below.

Repository Structure

├── main.tex                              # Full paper (LaTeX)
├── main.pdf                              # Compiled paper (PDF)
├── paper.md                              # Full paper (Markdown)
├── references.bib                        # Bibliography
├── VALIDATION_LOG.md                     # Chronological adversarial test log
├── data/
│   ├── voynich_eva_transcription.txt     # EVA corpus (Takahashi transcription)
│   ├── decoded_vocabulary.tsv            # 8,493-entry decoded vocabulary
│   ├── sinhala_dictionary.txt            # 1.47M romanized Sinhala dictionary
│   ├── external_pharmaceutical_vocab.tsv # 150 independently-sourced pharma terms
│   ├── crosslang_pharmaceutical_vocab.tsv# 60 concepts x 5 languages
│   ├── equalized_pharmaceutical_vocab.tsv# 115 concepts x 6 languages (fair test)
│   ├── hebrew_wordlist.txt               # Hebrew wordlist
│   ├── tamil_wordlist.txt                # Tamil wordlist
│   ├── hindi_wordlist.txt                # Hindi wordlist
│   ├── turkish_wordlist.txt              # Turkish wordlist
│   ├── latin_wordlist.txt                # Latin wordlist
│   └── arabic_wordlist.txt               # Arabic wordlist
├── scripts/
│   ├── h12_decoder.py                    # Core H12 decoder
│   ├── validate_coverage.py              # 4-tier coverage validation
│   ├── validate_vowel_final.py           # Vowel-final constraint test
│   ├── validate_domain_clustering.py     # Domain clustering analysis
│   ├── validate_phonotactics.py          # Cross-language phonotactic test
│   ├── validate_external_pharma.py       # External vocabulary + Panchavidha random comparison
│   ├── crosslang_discrimination_test.py  # Cross-language discrimination (5 Indic languages)
│   ├── structural_multilang_test.py      # Fair cross-language test (6 languages, equalized vocab)
│   ├── candidate_language_test.py        # Full candidate language comparison (7 languages)
│   ├── decoder_specificity_test.py       # Decoder specificity analysis
│   ├── loanword_isolation_test.py        # Sinhala vs Pali loanword isolation
│   ├── folio_pharma_clustering.py        # Folio-section clustering (FAILED — documented)
│   ├── recipe_sequence_test.py           # Recipe phase ordering (FAILED — documented)
│   ├── keyword_section_clustering.py     # Keyword-section clustering (Montemurro replication)
│   ├── entropy_analysis.py              # Entropy h2 and directionality analysis
│   ├── translate_manuscript.py           # Full manuscript translation
│   └── herbal_plant_guide.py             # Herbal folio plant identification
├── results/
│   ├── structural_evidence_summary.txt   # Master audit of all structural evidence
│   ├── structural_multilang.txt          # Fair cross-language test results
│   ├── panchavidha_validation.txt        # Panchavidha random comparison (Z=10.5)
│   ├── external_pharma_validation.txt    # External pharma validation (Z=3.5 controlled)
│   ├── crosslang_discrimination.txt      # Cross-language discrimination results
│   ├── candidate_language_validation.txt # Full candidate language comparison
│   ├── decoder_specificity.txt           # Decoder specificity analysis
│   ├── loanword_isolation.txt            # Loanword isolation results
│   ├── keyword_section_clustering.txt    # Keyword-section clustering results
│   └── entropy_directionality_analysis.txt # Entropy and directionality analysis
└── translation/
    ├── voynich_translation.md            # Full English translation
    └── herbal_plant_guide.md             # Plant identification guide

Requirements

  • Python 3.7+
  • NumPy (pip install numpy) — only required for validate_phonotactics.py

All other scripts use only the Python standard library.

Quick Start

# Decode a single word
python scripts/h12_decoder.py --word daiin
# Output: gena (Sinhala: "having taken")

# Decode the full corpus
python scripts/h12_decoder.py --input data/voynich_eva_transcription.txt --summary

# Run core validation scripts
python scripts/validate_coverage.py              # Tier 1: 90.7%, Total known: 99.3%
python scripts/validate_vowel_final.py           # 99.95% vowel-final
python scripts/validate_domain_clustering.py     # 46% medical (9x over random baseline)
python scripts/validate_phonotactics.py          # Sinhala ranks #2-3 across measures

# Run adversarial random-decoder comparisons
python scripts/validate_external_pharma.py       # Z=3.5 controlled, Panchavidha Z=10.5
python scripts/structural_multilang_test.py      # Fair cross-language: Sinhala 13x (Z=1.66)
python scripts/candidate_language_test.py        # Sinhala #1 by Z-score and raw coverage
python scripts/decoder_specificity_test.py       # H12 specificity analysis
python scripts/loanword_isolation_test.py        # Sinhala vs Pali discrimination

# Run literature-derived analyses
python scripts/keyword_section_clustering.py     # Keyword-section clustering (Z=31.81)
python scripts/entropy_analysis.py               # Entropy h2 and directionality

The H12 Decoder

The H12 decoder maps EVA (Extensible Voynich Alphabet) characters to Sinhala phonemes via abugida rules. The system uses a 14-phoneme inventory that corresponds to the spoken Elu dialect of medieval Sinhala.

Key mappings include:

  • sh maps to m
  • o maps to u
  • d (onset) maps to g
  • k (medial) maps to g
  • ch + consonant triggers devoicing
  • ct maps to th (aspirated dental)
  • ck maps to kh (aspirated velar)
  • cp maps to ph (aspirated labial)
  • q / h are silent (null carriers)

The decoder applies positional rules (onset vs. medial vs. coda) and abugida vowel-handling conventions to produce phonetic Sinhala output from EVA input.

The Key Insight: Spoken, Not Written

The manuscript encodes spoken language -- a phonetic transcription of medieval speech -- not written language in cipher. This explains why:

  • The 14-phoneme inventory matches pre-12th-century spoken Elu (no /b/, /v/, /f/)
  • N-gram frequencies match spoken Sinhala (#1 when spoken-weighted)
  • Edit-distance-1 matches are pronunciation variants, not errors
  • Compounds run together as pronounced (reflecting continuous speech)
  • Word-length distribution matches agglutinative spoken language

This "spoken, not written" framing resolves decades of failed cryptographic attacks: the manuscript was never a cipher, and classical decryption methods were searching for the wrong kind of structure.

Statistical Validation

Every test below can be reproduced by running the listed script. All Z-scores are computed against random decoders using the same architecture with randomized consonant mappings.

Adversarial Random-Decoder Tests

Test Script Result Significance
Panchavidha Kashaya Kalpana validate_external_pharma.py 6 decoded dosage form terms, 2,070 tokens, 0/200 random match Z = 10.5 (CIRCULAR — see disclosures)
SOV syntax (postpositional) — (see sov_syntax_output.txt) 78.0% postposition-after-noun Z = 7.04
SOV syntax (noun-before-verb) — (see sov_syntax_output.txt) 75.4% noun-before-verb Z = 4.19
External pharma vocabulary validate_external_pharma.py 7,211 tokens from 150 published terms, 0/38 beat H12 (controlled) Z = 3.5
Pharmaceutical collocations — (see collocation_output.txt) 16/36 Ayurvedic pairs, 0/10 random decoders produce any 0/10 random
Fair cross-language pharma structural_multilang_test.py Sinhala 5,171 tokens, 13x next-best (equalized 115-concept vocab) Z = 1.66
Candidate language comparison candidate_language_test.py Sinhala #1 by Z-score (-0.34) and raw coverage (49.9%) Least negative Z
Loanword isolation loanword_isolation_test.py Sinhala-only Z=1.39, Pali at 25% of Sinhala (consistent with loanwords) Weak pass
Keyword-section clustering keyword_section_clustering.py Semantic profiles differ by section (chi2=2096, 1000 shuffles) Z = 31.81

Structural Validation

Test Script Result
Coverage validate_coverage.py 90.7% glossed, 99.3% total known, 0.7% unknown
Vowel-final constraint validate_vowel_final.py 99.95% vowel-final (Elu abugida)
Domain clustering validate_domain_clustering.py 46.1% medical vocabulary (9x over ~5% random baseline)
Phonotactics validate_phonotactics.py Sinhala #2-3 across CV-pattern measures
Decoder specificity decoder_specificity_test.py Sinhala least-negative delta Z (-0.34), all languages negative
Directionality flip entropy_analysis.py EVA RTL-optimized (PP ratio 0.899), decoded LTR-optimized (1.647)
Entropy h2 entropy_analysis.py EVA h2=2.358, decoded h2=2.276 (delta -0.082, not closer to natural)

Honest Disclosures

Finding Details
All dictionary Z-scores negative H12 matches fewer dictionary words than random decoders for ALL languages. Sinhala is least negative (-0.34). Random decoders explore more of the dictionary space.
Cross-language Z below 2.0 Fair test Z=1.66. Raw advantage (13x) is clear but not statistically significant by this metric alone.
Generic CV hypothesis Cannot be fully rejected by dictionary matching alone. The structural tests (Panchavidha, SOV, Collocations) are the discriminating evidence.
Decoded entropy unchanged H12 decoded h2=2.276 vs EVA h2=2.358 (delta -0.082). Decoding does not move entropy closer to natural language (~3.3 bits).

Failed Tests (Documented for Transparency)

Test Script Result Why It Failed
Folio-section clustering folio_pharma_clustering.py Z = 0.2 Entire manuscript is medical — no non-medical control sections
Recipe phase ordering recipe_sequence_test.py All 3 metrics failed Matched vocabulary too generic to reveal recipe phases

Cross-Modal Convergences (Inherently Random-Proof)

These involve two independent modalities (decoded text + illustrations) converging on the same conclusion. No random decoder can match illustrations.

  • Petersen botanical IDs: 3 plants independently identified by botanist and by H12 (tamala, kamala, tambula)
  • Rajas on women's folios: decoded "ra" (menstrual/female principle) concentrates in sections illustrating women
  • Recipe-illustration coherence: pharmaceutical sections with vessel illustrations decode as processing steps; herbal sections with plant illustrations decode as plant parts

Contributing / Requesting Additional Tests

We welcome adversarial testing. If you can think of a test that would falsify or strengthen this hypothesis, we want to hear it.

Ways to contribute:

  • Open an issue describing a test you'd like to see. We will implement and run it, and publish the results regardless of outcome (as we have with our failed tests).
  • Submit a PR with a new validation script. Follow the pattern of existing scripts: decode the corpus with H12, run the same test on 200 random decoders, report the Z-score.
  • Attempt hostile replication: clone the repo, run the scripts, report what you find.
  • Challenge a specific claim: if any number in this README doesn't match what you get when running the script, file a bug.

Tests we would particularly value from the community:

  • Independent Sinhala/Elu linguistic assessment of the decoded text
  • Additional cross-language discrimination with larger vocabularies
  • Statistical tests for hidden structure that we haven't thought of
  • Comparison against other proposed Voynich decipherments using the same random-decoder methodology

The full audit trail of all tests (passed and failed) is in results/structural_evidence_summary.txt and VALIDATION_LOG.md.

Citation

@article{Basra2026,
  author  = {Basra, Kameldip Singh},
  title   = {The Voynich Manuscript Deciphered: A Phonetic Transcription of Spoken Elu-Sinhala},
  year    = {2026},
  doi     = {10.5281/zenodo.18611029},
  url     = {https://doi.org/10.5281/zenodo.18611029}
}

License

This work is licensed under CC BY-NC 4.0 -- free for personal and academic research use. For commercial licensing, contact kameldipbasra@gmail.com. See LICENSE file.

Contact

kameldipbasra@gmail.com