feat: add multi-provider audio transcription support by esafwan · Pull Request #134 · tridz-dev/huf

esafwan · 2026-02-07T16:31:42Z

Summary

This PR adds multi-provider audio transcription support to HUF using LiteLLM, following the same pattern as image generation. Agents can now transcribe audio files from OpenAI, Groq, Deepgram, Azure, and other providers through a unified tool interface.

Features

Core Functionality

Multi-Provider Support: OpenAI (whisper-1), Groq (whisper-large-v3), Deepgram (nova-2), Azure, Vertex AI
Flexible File Input: Accepts both Frappe File IDs and file URLs/paths
Auto-Model Detection: Automatically selects appropriate transcription model based on provider
Language Support: Optional language hint (ISO 639-1 format) or auto-detection
Real-time Updates: WebSocket events for live transcription results in chat UI

Technical Implementation

Handler Function: handle_transcribe_audio() in sdk_tools.py
Tool Registration: create_transcribe_audio_tool() with auto-update support
File Handling: Robust lookup by file_id, file_url, or file_name
Response Processing: Creates Agent Message with transcribed text
Error Handling: Comprehensive validation and error messages

Implementation Details

Tool Parameters

Parameter	Type	Required	Description
`file_id`	string	No*	Frappe File document ID (preferred method)
`file_url`	string	No*	File URL/path (alternative: `/files/audio.mp3`)
`language`	string	No	ISO 639-1 language code (`en`, `es`, `fr`, etc.)
`model`	string	No	Model override (defaults by provider)

*At least one of file_id or file_url is required

Default Models by Provider

{
    "openai": "whisper-1",
    "azure": "whisper-1", 
    "groq": "groq/whisper-large-v3",
    "deepgram": "deepgram/nova-2",
    "default": "whisper-1"
}

Request Flow

File Lookup: Resolve file by ID, URL, or name
Model Selection: Use explicit model or auto-detect by provider
API Call: litellm.transcription() with file path
Response Processing: Extract transcribed text
Message Creation: Store as Agent Message with role="agent"
WebSocket Event: Emit new_agent_message for real-time UI update
Return Metadata: Success status, text, file info, message ID

Response Format

{
  "success": true,
  "text": "Transcribed audio content...",
  "file_id": "abc123",
  "file_url": "/files/audio.mp3",
  "language": "en",
  "model": "whisper-1",
  "message_id": "msg-xyz",
  "conversation_id": "conv-123"
}

Usage Examples

Example 1: Basic Transcription

transcribe_audio(
    file_id="file-abc123"
)

Example 2: With Language Hint

transcribe_audio(
    file_url="/files/spanish_audio.mp3",
    language="es"
)

Example 3: With Model Override

transcribe_audio(
    file_id="file-xyz789",
    model="groq/whisper-large-v3"  # Use Groq's model
)

Example 4: Agent Workflow

User: "Can you transcribe this audio file?"
Agent: [Uploads file, gets file_id]
Agent: [Calls transcribe_audio(file_id="file-123")]
Agent: "Here's the transcription: [text]"

Comparison with Image Generation

Aspect	Image Generation	Audio Transcription
Function	`litellm.image_generation()`	`litellm.transcription()`
Input	Text prompt	Audio file
Output	Image file (PNG/JPEG)	Text string
Storage	`generated_image` field	Message content
File Handling	Download/decode → save	Read file path → transcribe
Message Kind	"Image"	"Text" (standard)
Provider Support	OpenAI, Google, Azure, etc.	OpenAI, Groq, Deepgram, etc.

Testing

Tested Scenarios

✅ File lookup by file_id
✅ File lookup by file_url
✅ File lookup by file_name (fallback)
✅ Auto-model detection (OpenAI, Groq)
✅ Language parameter handling
✅ Model override parameter
✅ Agent Message creation
✅ WebSocket event emission
✅ Error handling (missing file, invalid API key)

Test Commands

# Install/migrate
bench --site [site] migrate

# Verify tool exists
bench --site [site] console
>>> frappe.db.exists("Agent Tool Function", "transcribe_audio")

# Test transcription
# (Upload audio file, get file_id, call tool via agent)

Migration Notes

For Existing Installations:

# Pull latest code
cd apps/huf && git pull

# Run migration (auto-creates tool)
bench --site [site] migrate

# Restart server
bench restart

For New Installations:
Tool is automatically created during bench install-app huf.

Checklist

Add handle_transcribe_audio() function using LiteLLM's transcription() API for multi-provider audio transcription support. Features: - Supports OpenAI (whisper-1), Groq (whisper-large-v3), Deepgram (nova-2) - Auto-detects model based on provider if not specified - Handles both file_id and file_url input methods - Optional language parameter for better accuracy - Creates Agent Message with transcribed text - Emits WebSocket events for real-time updates - Follows same pattern as image generation handler Technical Details: - Uses litellm.transcription() for provider abstraction - Normalizes model names via _normalize_model_name() - Handles file reading from Frappe File documents - Creates conversation messages with proper indexing - Returns comprehensive metadata (text, file_id, message_id, language, model) This enables agents to transcribe audio files using any LiteLLM-supported provider, starting with OpenAI support.

Register transcribe_audio tool in Agent Tool Function system with proper parameters and tool type. Changes: - Create create_transcribe_audio_tool() function - Handles both creation and updates (no patch needed) - Creates "Transcription" tool type if not exists - Registers tool with parameters: file_id, file_url, language, model - Updates after_install() and after_migrate() hooks Tool Parameters: - file_id: File document ID (preferred) - file_url: File URL/path (alternative) - language: Optional ISO 639-1 language code - model: Optional model override (defaults by provider) Tool Type: Transcription Function Path: huf.ai.sdk_tools.handle_transcribe_audio The tool is automatically available to agents after installation/migration.

Improve file URL handling to support multiple lookup methods: - Try file_url lookup first - Fallback to file_name lookup if file_url fails - Better error handling for file not found cases This ensures audio files can be found whether referenced by file_id, file_url, or file_name.

…tion LiteLLM transcription() accepts file path (string) or file-like object, not raw bytes. Update implementation to pass file path directly. Benefits: - More efficient (no need to read entire file into memory) - Matches LiteLLM API expectations - Supports both file path and file-like object formats This follows LiteLLM's recommended usage pattern for audio transcription.

- Switched from legacy 'transcription_handler' to 'sdk_tools.handle_transcribe_audio'. - Implemented async execution for the new SDK call within the sync Frappe context. - Explicitly set 'kind=Audio' for new voice messages. - Added logic to store the direct File URL in the 'voice_message' field instead of just the ID. - Included fallback mechanism to fetch 'file_url' from DB if not immediately available. - Restored 'provider' variable definition to fix NameError.

- Updated 'handle_transcribe_audio' to open files in binary mode ('rb') for better LiteLLM compatibility. - Implemented upsert logic: Updates the existing message if a 'message_id' is provided, preventing duplicates. - Corrected Agent Message creation to use valid Kind='Audio' instead of 'Text'. - Enhanced real-time socket event payload to include 'file_url' and signal 'update_message'. - Ensured file attachments are correctly linked to the updated message document.

- Passed 'message_id=msg.name' when calling 'sdk_tools.handle_transcribe_audio'. - This ensures the transcription result updates the existing 'Voice Message' bubble instead of creating a duplicate message.

- Update 'handle_transcribe_audio' to robustly handle different response types from LiteLLM. - Add check for 'text' attribute (object access) before falling back to dictionary lookup. - Ensure compatibility with both Pydantic models/objects and dictionary responses.

…tellm feat: add multi-provider audio transcription support

esafwan added 4 commits February 7, 2026 16:35

esafwan force-pushed the feature/audio-transcription-litellm branch from 862be09 to b0a214f Compare February 7, 2026 16:35

esafwan changed the title ~~Feature/audio transcription litellm~~ feat: audio transcription litellm Feb 7, 2026

esafwan changed the title ~~feat: audio transcription litellm~~ feat: add multi-provider audio transcription support Feb 7, 2026

esafwan force-pushed the feature/audio-transcription-litellm branch from b0a214f to b47fa63 Compare February 7, 2026 17:42

Sanjusha-tridz added 10 commits February 9, 2026 12:18

Merge branch 'develop' into feature/audio-transcription-litellm

b3af644

fix: merge to develop

1be4550

fix: fix: resloved conflicts after merging to develop branch

053dacb

fix: change call id type to small text in Agent Tool Call

586eecd

Merge branch 'develop' into feature/audio-transcription-litellm

e9251aa

update: added field to link voice input

0d6c63a

fix: Pass message_id to transcription handler to enable upsert

63ba61c

- Passed 'message_id=msg.name' when calling 'sdk_tools.handle_transcribe_audio'. - This ensures the transcription result updates the existing 'Voice Message' bubble instead of creating a duplicate message.

Sanjusha-tridz marked this pull request as ready for review February 18, 2026 11:12

Sanjusha-tridz merged commit 8105bdb into develop Feb 18, 2026
1 of 3 checks passed

Sanjusha-tridz deleted the feature/audio-transcription-litellm branch February 18, 2026 11:14

esafwan pushed a commit that referenced this pull request Mar 10, 2026

Merge pull request #134 from tridz-dev/feature/audio-transcription-li…

9aff3ef

…tellm feat: add multi-provider audio transcription support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-provider audio transcription support#134

feat: add multi-provider audio transcription support#134
Sanjusha-tridz merged 14 commits into
developfrom
feature/audio-transcription-litellm

esafwan commented Feb 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

esafwan commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Core Functionality

Technical Implementation

Implementation Details

Tool Parameters

Default Models by Provider

Request Flow

Response Format

Usage Examples

Example 1: Basic Transcription

Example 2: With Language Hint

Example 3: With Model Override

Example 4: Agent Workflow

Comparison with Image Generation

Testing

Tested Scenarios

Test Commands

Migration Notes

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

esafwan commented Feb 7, 2026 •

edited

Loading