Skip to content

feat: audio ingestion via Whisper API#31

Open
RaviTharuma wants to merge 3 commits intoAVIDS2:mainfrom
RaviTharuma:feature/memorix-ek1-audio-ingestion
Open

feat: audio ingestion via Whisper API#31
RaviTharuma wants to merge 3 commits intoAVIDS2:mainfrom
RaviTharuma:feature/memorix-ek1-audio-ingestion

Conversation

@RaviTharuma
Copy link
Copy Markdown
Contributor

Summary

Adds an memorix_ingest_audio MCP tool that transcribes audio files using the Whisper API and stores the transcript as a memorix observation.

What's included

  • src/multimodal/audio-loader.ts — Whisper API client with:
    • Configurable model (MEMORIX_WHISPER_MODEL, default: whisper-1)
    • Language hint and prompt support
    • Chunked file reading for large files
    • Automatic format detection from extension (mp3, wav, m4a, ogg, flac, webm, mp4)
  • src/multimodal/index.ts — shared multimodal utilities (MIME detection, file validation)
  • src/server.ts — MCP tool registration (memorix_ingest_audio)
  • tests/multimodal/audio-loader.test.ts — 8 tests covering:
    • Transcript extraction and observation storage
    • Unsupported format rejection
    • Missing file handling
    • Language and prompt passthrough
    • API error propagation

Design decisions

  • LLM-optional: requires MEMORIX_WHISPER_API_KEY (falls back to OPENAI_API_KEY). Tool returns clear error if unconfigured.
  • Stores transcript as how-it-works observation type with audio metadata in facts.
  • No new npm dependencies — uses native fetch for API calls and fs for file reading.

Tests

8 pass, 0 fail

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e1369a556

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/server.ts Outdated
Comment on lines +3078 to +3079
markInternalWrite();
const result = await ingestAudio(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move internal-write marker to the actual write point

markInternalWrite() is called before ingestAudio(), but audio ingestion performs a remote transcription that can run up to 120 seconds; the hot-reload skip window is only 10 seconds, so long transcriptions can expire the skip window before storeObservation writes. In that case the file watcher will treat this internal write as external and trigger unnecessary reload/reindex work, which can cause avoidable contention during ingest-heavy workflows.

Useful? React with 👍 / 👎.

@@ -0,0 +1,173 @@
import { describe, it, expect, afterEach, beforeEach } from 'bun:test';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use Vitest APIs in the new test file

The repository’s test command runs Vitest and includes tests/**/*.test.ts, but this new test imports from bun:test. Under the declared test runner, that import is unresolved, so CI/local npm test will fail before executing these assertions. Switching this file to Vitest imports (or globals) is necessary to keep the test suite runnable.

Useful? React with 👍 / 👎.

@AVIDS2
Copy link
Copy Markdown
Owner

AVIDS2 commented Mar 30, 2026

Thanks for pushing on multimodal ingestion — the capability is useful, but this one still has a couple of concrete blockers.

  1. The test file uses bun:test, while this repo runs vitest in CI. Right now that means the PR fails immediately in the standard test matrix.
  2. In src/server.ts, the object passed into storeObservation is inferred as type: string, but storeObservation expects a real ObservationType.
  3. More importantly, the config path is too loose for the implementation: transcribeAudio() pulls the generic Memorix LLM API key via getLLMApiKey(), but the code then talks specifically to OpenAI/Groq Whisper transcription endpoints. In a valid Memorix setup that is configured for Anthropic or another non-Whisper provider, this can end up sending the wrong credential/provider combination to an incompatible API.

So I don't think this is just a test-fix PR — it needs one more pass on provider/config semantics as well.

@RaviTharuma
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed breakdown — all three points are valid.

  1. bun:test → vitest: Will switch the test file imports to vitest globals.
  2. ObservationType inference: Will add an explicit cast / assertion so the object flowing into storeObservation satisfies the ObservationType constraint rather than widening to string.
  3. Provider/config semantics: This is the substantive one. The right fix is to add a Whisper-specific config path (e.g. MEMORIX_WHISPER_API_KEY / MEMORIX_WHISPER_BASE_URL) that is separate from the general LLM key, so audio ingestion doesn't assume the general LLM provider is Whisper-compatible. I'll update transcribeAudio() to use those dedicated config values and document them in the README section for audio ingestion.

…hisper config

Three changes:

1. storeFn parameter type: type: string -> type: ObservationType so the
   callback signature is compatible with storeObservation() without a cast.

2. Dedicated Whisper key resolution: transcribeAudio() now checks
   MEMORIX_WHISPER_API_KEY first, then falls back to OPENAI_API_KEY.
   Does NOT fall through to MEMORIX_LLM_API_KEY — that key may point
   to Anthropic or another non-Whisper provider, which would send the
   wrong credential to the /audio/transcriptions endpoint.

3. MEMORIX_WHISPER_BASE_URL: optional override for self-hosted or
   alternative Whisper-compatible endpoints (vLLM, LocalAI, etc.).

Tests: added coverage for MEMORIX_WHISPER_API_KEY priority and
MEMORIX_WHISPER_BASE_URL custom endpoint; cleanup ensures both vars
are cleared in afterEach.
Copy link
Copy Markdown
Owner

@AVIDS2 AVIDS2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test/type issues look improved, but there is still a concrete config/runtime mismatch before this is safe to merge: the PR advertises provider: groq, yet the Whisper credential resolver only falls back to MEMORIX_WHISPER_API_KEY or OPENAI_API_KEY. In the normal provider=groq case, a user with only GROQ_API_KEY configured will still fail, and a user with only OPENAI_API_KEY configured will send the wrong credential to the Groq endpoint. Please either add provider-correct key resolution for Groq, or narrow the feature so it no longer claims Groq preset support until that path is actually wired correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants