Skip to content

Latest commit

 

History

History
60 lines (42 loc) · 3.45 KB

File metadata and controls

60 lines (42 loc) · 3.45 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Monorepo with two independent subprojects:

  • expo-app/ — React Native (Expo, TypeScript) app that lets users input Spanish words, groups them by semantic similarity, and finds related "secret code" words using FastText embeddings.
  • asset-parser/ — Python scripts that download the Spanish word list from OpenSLR, extract the top 50K words, generate 300-dimensional FastText embeddings, and copy the result into expo-app's assets.

Build & Run Commands

asset-parser (Python)

cd asset-parser
python -m venv venv && source venv/bin/activate  # first time
pip install -r requirements.txt                   # first time
python main.py                                    # download + generate top_spanish_words.json
python generate_embeddings.py                     # generate embeddings_N.json chunks + copy to expo-app/assets/
python generate_hypernyms.py                      # generate hypernyms.json + copy to expo-app/assets/
python generate_cultural_relations.py              # download ConceptNet + generate cultural_relations.json + copy to expo-app/assets/

get_words.py produces top_spanish_words.json, which both generate_embeddings.py and generate_hypernyms.py consume.

expo-app (Expo / React Native)

cd expo-app
npm install            # install dependencies
npx expo start         # start dev server
npx expo start --ios   # run on iOS simulator
npx expo start --android  # run on Android emulator

Requires embeddings_*.json files in expo-app/assets/ — generated by the asset-parser pipeline above.

Architecture

The data pipeline flows in one direction:

  1. asset-parser/get_words.py downloads the OpenSLR Spanish word list and keeps the top 50K words.
  2. asset-parser/generate_embeddings.py loads a FastText model (cc.es.300.bin), generates 300-dim embeddings (rounded to 4 decimals), splits into chunks if needed (max 100 MB each), and copies them to expo-app/assets/embeddings_N.bin.
  3. asset-parser/generate_hypernyms.py uses NLTK WordNet (Open Multilingual Wordnet for Spanish) to find hypernyms for each word and copies the result to expo-app/assets/hypernyms.json.
  4. asset-parser/generate_cultural_relations.py downloads ConceptNet 5.7 CSV, filters Spanish-only edges, and builds a bidirectional relation map copied to expo-app/assets/cultural_relations.json.
  5. The Expo app loads embeddings_0.bin and hypernyms.json via require() at runtime (cached after first load). With 50K words the embeddings split into 2 chunks.

Expo app structure

  • App.tsx — main UI: word input, search trigger, results display with "Not convinced" pagination.
  • src/embeddings.ts — loads the single embeddings file and caches it.
  • src/hypernyms.ts — loads and caches the hypernyms JSON.
  • src/search.ts — cosine similarity, single-linkage word grouping, and brute-force related word search (top 200 per group, displayed 5 at a time).

Key constraints

  • With 50K words the embeddings split into 2 chunks (embeddings_0.bin). If the word count grows and exceeds 100 MB, the generator will split into multiple chunks — in that case, update src/embeddings.ts and app.json to load all chunks.
  • embeddings_*.json, hypernyms.json are gitignored in both asset-parser/ and expo-app/assets/. The FastText model files (cc.es.300.bin, cc.es.300.bin.gz) are also gitignored.