7.7 KiB
Volcengine ASR Stage-1 Replacement Design
Date: 2026-03-20
Goal
Replace the current Stage 1 transcription agent with Volcengine's flash ASR API using audio.data base64 input so original dialogue recognition is based on dedicated ASR instead of a general multimodal model.
Problem
The current Stage 1 pipeline uses a general model to transcribe dialogue from uploaded media. Even after changing the request to audio-only input, recognition quality is still limited by a model whose primary job is not ASR. When Stage 1 drifts from the real dialogue, the downstream translation and TTS stages faithfully amplify the mistake.
We need a Stage 1 that is optimized for:
- faithful speech recognition
- utterance-level timestamps
- speaker separation
- stable, repeatable audio-only behavior
Recommendation
Adopt Volcengine's flash ASR API only for Stage 1 and keep the rest of the 4+1 pipeline unchanged:
Transcription-> Volcengine ASRSegmentation-> existing local segmenterTranslation-> existing LLM translation stageVoice Matching-> existing matcherValidation-> existing validator
This gives us a better ASR foundation without rewriting the rest of the subtitle stack.
Why This Fits
- The API is purpose-built for recorded audio recognition.
- It returns utterance-level timing data that maps well to our subtitle model.
- It supports speaker info and gender detection, which match our Stage 1 output shape.
- It accepts
audio.dataas base64, which avoids temporary public URL hosting. - It returns in a single request, so Stage 1 becomes simpler than the standard submit/query flow.
Scope
In Scope
- Replace Stage 1 transcription provider for Doubao/Volcengine path with flash ASR API.
- Extract audio from uploaded or temp video before Stage 1.
- Send extracted audio as
audio.database64 in a single ASR request. - Map ASR
utterancesinto internalTranscriptSegment[]. - Preserve existing downstream stages and existing editor UI contract.
- Add detailed Stage 1 logs for ASR submit/poll/result mapping.
Out of Scope
- Replacing translation, TTS, or voice matching providers.
- Changing the editor data contract beyond Stage 1 diagnostics.
- Reworking trim/export behavior outside Stage 1 input preparation.
- Automatic fallback to another ASR vendor in this first iteration.
Current State
Relevant files:
- transcriptionStage.ts
- multiStageSubtitleGeneration.ts
- subtitleGeneration.ts
- subtitleService.ts
- stageTypes.ts
Current Stage 1 responsibilities:
- extract audio locally when using Doubao
- call a model endpoint directly
- parse model JSON into
TranscriptSegment[]
This means recognition logic and model prompt logic are still coupled together in one module.
Target Architecture
Stage 1 Flow
- Receive video file on server.
- Extract normalized WAV audio using
ffmpeg. - Base64-encode the WAV audio.
- Send the audio to
recognize/flash. - Map
result.utterancesintoTranscriptSegment[]. - Pass those segments to the existing
segmentationstage.
Proposed Modules
src/server/subtitleStages/transcriptionStage.ts- keep the stage entrypoint
- delegate provider-specific ASR work to a helper
src/server/volcengineAsr.ts- send flash recognition request
- parse and massage API payloads
Input and Output Mapping
ASR Input
Stage 1 should send:
- extracted WAV audio
- language hint when configured
show_utterances=trueenable_speaker_info=trueenable_gender_detection=true- conservative transcription options
We should avoid options that aggressively rewrite spoken language for readability in Stage 1.
ASR Output -> Internal Mapping
From ASR:
utterances[].text->originalTextutterances[].start_time->startTimeutterances[].end_time->endTime- speaker info ->
speaker,speakerId - gender info ->
gender
Internal notes:
- If numeric confidence is unavailable, set
confidenceto a safe default and rely onneedsReviewheuristics from other signals. - If gender is missing, normalize to
unknown. - If utterances are missing, fall back to a single full-text segment only as a last resort.
Temporary Audio URL Strategy
The flash ASR API supports audio.data, so Stage 1 can avoid temporary public URLs entirely.
Recommended implementation:
- extract WAV to a temp file
- read the file into base64
- send the base64 string in
audio.data - delete the temp file immediately after extraction
This avoids network reachability issues and is the simplest option for local and server environments.
Logging
Add explicit Stage 1 ASR logs:
- request started
- request finished
- result mapping summary
- cleanup status
Log fields should include:
requestIdproviderresourceId- API endpoint
- temp audio path presence
- base64 audio size
- request duration
- utterance count
We should continue logging summaries by default, not full raw transcripts, unless the user explicitly asks for full payload logging.
Error Handling
Stage 1 should fail clearly for:
- ffmpeg extraction failure
- ASR request rejection
- ASR service busy response
- malformed ASR result payload
Preferred behavior:
- fail Stage 1
- mark subtitle job as failed
- preserve enough context in logs to identify whether failure happened in extraction, submit, polling, or mapping
Configuration
Add environment-driven config for the ASR API:
- app key
- access key
- resource id
- flash URL
- request timeout
- optional language hint
These should be separate from the existing Doubao LLM config because this is no longer the same provider call shape.
Testing Strategy
Unit Tests
- audio extraction helper returns WAV base64
- ASR submit request body matches expected schema
- ASR response parsing handles success and API error codes
- ASR result mapping produces valid
TranscriptSegment[]
Integration-Level Tests
transcriptionStageuses ASR helper when configuredmultiStageSubtitleGenerationstill receives valid transcript outputsubtitleServicekeeps frontend contract unchanged
Regression Focus
- no change to translation or voice matching contracts
- no regression in subtitle job stages and progress messages
- no regression in editor auto-generation flow
Risks
1. Confidence mismatch
If the ASR result does not provide a numeric confidence comparable to the current stage contract, we need a fallback policy. We should not invent precise confidence values.
2. Speaker metadata variance
Speaker labels and gender fields may differ from the current Stage 1 output. Downstream code already tolerates unknown, but we should normalize carefully.
Rollout Plan
- Implement the flash ASR client and mapping behind Stage 1.
- Keep old transcription path available behind a feature flag or fallback branch during transition.
- Validate with the known failing sample video.
- Remove the old direct multimodal transcription path once logs and results are stable.
Success Criteria
- Stage 1 no longer sends video content to a general model for transcription.
- Logs show flash ASR request lifecycle for each transcription request.
- The known failing sample produces original dialogue closer to the ground truth than the current Stage 1.
- Downstream translation and voice stages continue to work without UI contract changes.