Skip to content
View jwalin-shah's full-sized avatar

Block or report jwalin-shah

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
jwalin-shah/README.md
Jwalin Shah — I measure what LLMs do. 3 scalars beat 71M parameters at 0.975 F1 vs 0.331 on transitive closure.

AI Systems Engineer — evaluation · grounded reasoning · on-device inference. Currently a Research Contributor at Sentient Arena (Cohort 0).

↗ portfolio · ↗ email · ↗ linkedin


Selected work

tensor-logic — working through Domingos (2025). A 3-scalar tensor-logic recurrence vs. a 71M-parameter MLP, same task. mean F1 0.975 vs 0.331 · biggest graph 1,532 nodes (sympy) · zero-shot to real Python imports. Honestly documented limits — parity remains unlearnable.

officeqa-arena — grounded financial QA, Sentient Arena. 184.5/246 (75.0%) · $1.71 total · 9 architectural generations. Headline finding: shell grep on raw TXT beat an 11GB SQLite + 10-component consensus pipeline. 48% of failures = wrong table/row/column extraction.

jarvis-ai-assistant — privacy-first iMessage assistant on an 8GB M2 Air. mean draft 0.42s · p95 1.15s · retrieval Hit@5 0.88 · hallucination gate 96.2% pass. MLX-native, zero cloud dependencies. Evaluated 37 model configs.

openhuman — open-source agentic desktop assistant (contributor). GNU · macOS · Windows · Linux · 247★ · 36 forks. Local-first KB (Neocortex), background self-learning loops (Subconscious), screen intelligence, inline autocomplete + voice, all on device.


Languages by bytes across 19 public repos: Python 80%, TypeScript 9%, JavaScript 3%, Svelte 3%, Shell 3%, Other 2%. 26-week commit heatmap.

Background

Sentient Arena Research Contributor (Cohort 0) grounded financial reasoning · eval infra · failure-mode analysis
Skild AI Data Operations Lead robotics data systems · 5 platforms · 25+ operators · task success +40%, overhead −50%

Focus

grounded LLM reasoning · evaluation harnesses · deterministic computation · tool-augmented agents · hallucination measurement · on-device inference (MLX) · privacy-first architectures

Reach me

best for research collabs, eval & reliability work, on-device AI. ✉️ jwalinshah13@gmail.com · 💼 linkedin · 🌐 portfolio

Pinned Loading

  1. tinyhumansai/openhuman tinyhumansai/openhuman Public

    Your Personal AI super intelligence. Private, Simple and extremely powerful.

    Rust 347 50

  2. tensor-logic tensor-logic Public

    Working through Pedro Domingos' tensor logic paper — runnable demos walking from one einsum to continual learning.

    Python

  3. officeqa-arena officeqa-arena Public

    Competition entry: end-to-end OfficeQA pipeline for Sentient Arena — retrieval, ledger extraction, and LLM reasoning over 10k+ financial documents

    Python 3 1

  4. inbox inbox Public

    Unified inbox TUI: iMessage, Gmail, Calendar, Drive, Notes, Reminders, GitHub — local FastAPI server + Textual TUI

    Python

  5. jarvis-ai-assistant jarvis-ai-assistant Public

    Privacy-first local AI assistant for macOS (MLX + sqlite-vec) with 0.42s mean latency and <5% hallucination rate

    Python 2 1

  6. data-connect-framework data-connect-framework Public

    Privacy-first personal data engine — schema-driven ingestion, canonical entity modeling, and local-first data pipelines

    Python