# Volcengine ASR Stage-1 Replacement Design **Date:** 2026-03-20 **Goal** Replace the current Stage 1 `transcription` agent with Volcengine's flash ASR API using `audio.data` base64 input so original dialogue recognition is based on dedicated ASR instead of a general multimodal model. ## Problem The current Stage 1 pipeline uses a general model to transcribe dialogue from uploaded media. Even after changing the request to audio-only input, recognition quality is still limited by a model whose primary job is not ASR. When Stage 1 drifts from the real dialogue, the downstream translation and TTS stages faithfully amplify the mistake. We need a Stage 1 that is optimized for: - faithful speech recognition - utterance-level timestamps - speaker separation - stable, repeatable audio-only behavior ## Recommendation Adopt Volcengine's flash ASR API only for Stage 1 and keep the rest of the `4+1` pipeline unchanged: 1. `Transcription` -> Volcengine ASR 2. `Segmentation` -> existing local segmenter 3. `Translation` -> existing LLM translation stage 4. `Voice Matching` -> existing matcher 5. `Validation` -> existing validator This gives us a better ASR foundation without rewriting the rest of the subtitle stack. ## Why This Fits - The API is purpose-built for recorded audio recognition. - It returns utterance-level timing data that maps well to our subtitle model. - It supports speaker info and gender detection, which match our Stage 1 output shape. - It accepts `audio.data` as base64, which avoids temporary public URL hosting. - It returns in a single request, so Stage 1 becomes simpler than the standard submit/query flow. ## Scope ### In Scope - Replace Stage 1 transcription provider for Doubao/Volcengine path with flash ASR API. - Extract audio from uploaded or temp video before Stage 1. - Send extracted audio as `audio.data` base64 in a single ASR request. - Map ASR `utterances` into internal `TranscriptSegment[]`. - Preserve existing downstream stages and existing editor UI contract. - Add detailed Stage 1 logs for ASR submit/poll/result mapping. ### Out of Scope - Replacing translation, TTS, or voice matching providers. - Changing the editor data contract beyond Stage 1 diagnostics. - Reworking trim/export behavior outside Stage 1 input preparation. - Automatic fallback to another ASR vendor in this first iteration. ## Current State Relevant files: - [transcriptionStage.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/transcriptionStage.ts) - [multiStageSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/multiStageSubtitleGeneration.ts) - [subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts) - [subtitleService.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/services/subtitleService.ts) - [stageTypes.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/stageTypes.ts) Current Stage 1 responsibilities: - extract audio locally when using Doubao - call a model endpoint directly - parse model JSON into `TranscriptSegment[]` This means recognition logic and model prompt logic are still coupled together in one module. ## Target Architecture ### Stage 1 Flow 1. Receive video file on server. 2. Extract normalized WAV audio using `ffmpeg`. 3. Base64-encode the WAV audio. 4. Send the audio to `recognize/flash`. 5. Map `result.utterances` into `TranscriptSegment[]`. 6. Pass those segments to the existing `segmentation` stage. ### Proposed Modules - `src/server/subtitleStages/transcriptionStage.ts` - keep the stage entrypoint - delegate provider-specific ASR work to a helper - `src/server/volcengineAsr.ts` - send flash recognition request - parse and massage API payloads ## Input and Output Mapping ### ASR Input Stage 1 should send: - extracted WAV audio - language hint when configured - `show_utterances=true` - `enable_speaker_info=true` - `enable_gender_detection=true` - conservative transcription options We should avoid options that aggressively rewrite spoken language for readability in Stage 1. ### ASR Output -> Internal Mapping From ASR: - `utterances[].text` -> `originalText` - `utterances[].start_time` -> `startTime` - `utterances[].end_time` -> `endTime` - speaker info -> `speaker`, `speakerId` - gender info -> `gender` Internal notes: - If numeric confidence is unavailable, set `confidence` to a safe default and rely on `needsReview` heuristics from other signals. - If gender is missing, normalize to `unknown`. - If utterances are missing, fall back to a single full-text segment only as a last resort. ## Temporary Audio URL Strategy The flash ASR API supports `audio.data`, so Stage 1 can avoid temporary public URLs entirely. Recommended implementation: - extract WAV to a temp file - read the file into base64 - send the base64 string in `audio.data` - delete the temp file immediately after extraction This avoids network reachability issues and is the simplest option for local and server environments. ## Logging Add explicit Stage 1 ASR logs: - request started - request finished - result mapping summary - cleanup status Log fields should include: - `requestId` - `provider` - `resourceId` - API endpoint - temp audio path presence - base64 audio size - request duration - utterance count We should continue logging summaries by default, not full raw transcripts, unless the user explicitly asks for full payload logging. ## Error Handling Stage 1 should fail clearly for: - ffmpeg extraction failure - ASR request rejection - ASR service busy response - malformed ASR result payload Preferred behavior: - fail Stage 1 - mark subtitle job as failed - preserve enough context in logs to identify whether failure happened in extraction, submit, polling, or mapping ## Configuration Add environment-driven config for the ASR API: - app key - access key - resource id - flash URL - request timeout - optional language hint These should be separate from the existing Doubao LLM config because this is no longer the same provider call shape. ## Testing Strategy ### Unit Tests - audio extraction helper returns WAV base64 - ASR submit request body matches expected schema - ASR response parsing handles success and API error codes - ASR result mapping produces valid `TranscriptSegment[]` ### Integration-Level Tests - `transcriptionStage` uses ASR helper when configured - `multiStageSubtitleGeneration` still receives valid transcript output - `subtitleService` keeps frontend contract unchanged ### Regression Focus - no change to translation or voice matching contracts - no regression in subtitle job stages and progress messages - no regression in editor auto-generation flow ## Risks ### 1. Confidence mismatch If the ASR result does not provide a numeric confidence comparable to the current stage contract, we need a fallback policy. We should not invent precise confidence values. ### 2. Speaker metadata variance Speaker labels and gender fields may differ from the current Stage 1 output. Downstream code already tolerates `unknown`, but we should normalize carefully. ## Rollout Plan 1. Implement the flash ASR client and mapping behind Stage 1. 2. Keep old transcription path available behind a feature flag or fallback branch during transition. 3. Validate with the known failing sample video. 4. Remove the old direct multimodal transcription path once logs and results are stable. ## Success Criteria - Stage 1 no longer sends video content to a general model for transcription. - Logs show flash ASR request lifecycle for each transcription request. - The known failing sample produces original dialogue closer to the ground truth than the current Stage 1. - Downstream translation and voice stages continue to work without UI contract changes.