feat: add multi-provider audio transcription support#134
Merged
Conversation
Add handle_transcribe_audio() function using LiteLLM's transcription() API for multi-provider audio transcription support. Features: - Supports OpenAI (whisper-1), Groq (whisper-large-v3), Deepgram (nova-2) - Auto-detects model based on provider if not specified - Handles both file_id and file_url input methods - Optional language parameter for better accuracy - Creates Agent Message with transcribed text - Emits WebSocket events for real-time updates - Follows same pattern as image generation handler Technical Details: - Uses litellm.transcription() for provider abstraction - Normalizes model names via _normalize_model_name() - Handles file reading from Frappe File documents - Creates conversation messages with proper indexing - Returns comprehensive metadata (text, file_id, message_id, language, model) This enables agents to transcribe audio files using any LiteLLM-supported provider, starting with OpenAI support.
Register transcribe_audio tool in Agent Tool Function system with proper parameters and tool type. Changes: - Create create_transcribe_audio_tool() function - Handles both creation and updates (no patch needed) - Creates "Transcription" tool type if not exists - Registers tool with parameters: file_id, file_url, language, model - Updates after_install() and after_migrate() hooks Tool Parameters: - file_id: File document ID (preferred) - file_url: File URL/path (alternative) - language: Optional ISO 639-1 language code - model: Optional model override (defaults by provider) Tool Type: Transcription Function Path: huf.ai.sdk_tools.handle_transcribe_audio The tool is automatically available to agents after installation/migration.
Improve file URL handling to support multiple lookup methods: - Try file_url lookup first - Fallback to file_name lookup if file_url fails - Better error handling for file not found cases This ensures audio files can be found whether referenced by file_id, file_url, or file_name.
…tion LiteLLM transcription() accepts file path (string) or file-like object, not raw bytes. Update implementation to pass file path directly. Benefits: - More efficient (no need to read entire file into memory) - Matches LiteLLM API expectations - Supports both file path and file-like object formats This follows LiteLLM's recommended usage pattern for audio transcription.
862be09 to
b0a214f
Compare
b0a214f to
b47fa63
Compare
- Switched from legacy 'transcription_handler' to 'sdk_tools.handle_transcribe_audio'. - Implemented async execution for the new SDK call within the sync Frappe context. - Explicitly set 'kind=Audio' for new voice messages. - Added logic to store the direct File URL in the 'voice_message' field instead of just the ID. - Included fallback mechanism to fetch 'file_url' from DB if not immediately available. - Restored 'provider' variable definition to fix NameError.
- Updated 'handle_transcribe_audio' to open files in binary mode ('rb') for better LiteLLM compatibility.
- Implemented upsert logic: Updates the existing message if a 'message_id' is provided, preventing duplicates.
- Corrected Agent Message creation to use valid Kind='Audio' instead of 'Text'.
- Enhanced real-time socket event payload to include 'file_url' and signal 'update_message'.
- Ensured file attachments are correctly linked to the updated message document.
- Passed 'message_id=msg.name' when calling 'sdk_tools.handle_transcribe_audio'. - This ensures the transcription result updates the existing 'Voice Message' bubble instead of creating a duplicate message.
- Update 'handle_transcribe_audio' to robustly handle different response types from LiteLLM. - Add check for 'text' attribute (object access) before falling back to dictionary lookup. - Ensure compatibility with both Pydantic models/objects and dictionary responses.
esafwan
pushed a commit
that referenced
this pull request
Mar 10, 2026
…tellm feat: add multi-provider audio transcription support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds multi-provider audio transcription support to HUF using LiteLLM, following the same pattern as image generation. Agents can now transcribe audio files from OpenAI, Groq, Deepgram, Azure, and other providers through a unified tool interface.
Features
Core Functionality
Technical Implementation
handle_transcribe_audio()insdk_tools.pycreate_transcribe_audio_tool()with auto-update supportImplementation Details
Tool Parameters
file_idfile_url/files/audio.mp3)languageen,es,fr, etc.)model*At least one of
file_idorfile_urlis requiredDefault Models by Provider
{ "openai": "whisper-1", "azure": "whisper-1", "groq": "groq/whisper-large-v3", "deepgram": "deepgram/nova-2", "default": "whisper-1" }Request Flow
litellm.transcription()with file pathnew_agent_messagefor real-time UI updateResponse Format
{ "success": true, "text": "Transcribed audio content...", "file_id": "abc123", "file_url": "/files/audio.mp3", "language": "en", "model": "whisper-1", "message_id": "msg-xyz", "conversation_id": "conv-123" }Usage Examples
Example 1: Basic Transcription
Example 2: With Language Hint
Example 3: With Model Override
Example 4: Agent Workflow
Comparison with Image Generation
litellm.image_generation()litellm.transcription()generated_imagefieldTesting
Tested Scenarios
file_idfile_urlfile_name(fallback)Test Commands
Migration Notes
For Existing Installations:
For New Installations:
Tool is automatically created during
bench install-app huf.Checklist