video_translate/docs/plans/2026-03-20-volcengine-asr-transcription-design.md
Song367 85065cbca3
All checks were successful
Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s
Build multi-stage subtitle and dubbing pipeline
2026-03-20 20:55:40 +08:00

234 lines
7.7 KiB
Markdown

# Volcengine ASR Stage-1 Replacement Design
**Date:** 2026-03-20
**Goal**
Replace the current Stage 1 `transcription` agent with Volcengine's flash ASR API using `audio.data` base64 input so original dialogue recognition is based on dedicated ASR instead of a general multimodal model.
## Problem
The current Stage 1 pipeline uses a general model to transcribe dialogue from uploaded media. Even after changing the request to audio-only input, recognition quality is still limited by a model whose primary job is not ASR. When Stage 1 drifts from the real dialogue, the downstream translation and TTS stages faithfully amplify the mistake.
We need a Stage 1 that is optimized for:
- faithful speech recognition
- utterance-level timestamps
- speaker separation
- stable, repeatable audio-only behavior
## Recommendation
Adopt Volcengine's flash ASR API only for Stage 1 and keep the rest of the `4+1` pipeline unchanged:
1. `Transcription` -> Volcengine ASR
2. `Segmentation` -> existing local segmenter
3. `Translation` -> existing LLM translation stage
4. `Voice Matching` -> existing matcher
5. `Validation` -> existing validator
This gives us a better ASR foundation without rewriting the rest of the subtitle stack.
## Why This Fits
- The API is purpose-built for recorded audio recognition.
- It returns utterance-level timing data that maps well to our subtitle model.
- It supports speaker info and gender detection, which match our Stage 1 output shape.
- It accepts `audio.data` as base64, which avoids temporary public URL hosting.
- It returns in a single request, so Stage 1 becomes simpler than the standard submit/query flow.
## Scope
### In Scope
- Replace Stage 1 transcription provider for Doubao/Volcengine path with flash ASR API.
- Extract audio from uploaded or temp video before Stage 1.
- Send extracted audio as `audio.data` base64 in a single ASR request.
- Map ASR `utterances` into internal `TranscriptSegment[]`.
- Preserve existing downstream stages and existing editor UI contract.
- Add detailed Stage 1 logs for ASR submit/poll/result mapping.
### Out of Scope
- Replacing translation, TTS, or voice matching providers.
- Changing the editor data contract beyond Stage 1 diagnostics.
- Reworking trim/export behavior outside Stage 1 input preparation.
- Automatic fallback to another ASR vendor in this first iteration.
## Current State
Relevant files:
- [transcriptionStage.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/transcriptionStage.ts)
- [multiStageSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/multiStageSubtitleGeneration.ts)
- [subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts)
- [subtitleService.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/services/subtitleService.ts)
- [stageTypes.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/stageTypes.ts)
Current Stage 1 responsibilities:
- extract audio locally when using Doubao
- call a model endpoint directly
- parse model JSON into `TranscriptSegment[]`
This means recognition logic and model prompt logic are still coupled together in one module.
## Target Architecture
### Stage 1 Flow
1. Receive video file on server.
2. Extract normalized WAV audio using `ffmpeg`.
3. Base64-encode the WAV audio.
4. Send the audio to `recognize/flash`.
5. Map `result.utterances` into `TranscriptSegment[]`.
6. Pass those segments to the existing `segmentation` stage.
### Proposed Modules
- `src/server/subtitleStages/transcriptionStage.ts`
- keep the stage entrypoint
- delegate provider-specific ASR work to a helper
- `src/server/volcengineAsr.ts`
- send flash recognition request
- parse and massage API payloads
## Input and Output Mapping
### ASR Input
Stage 1 should send:
- extracted WAV audio
- language hint when configured
- `show_utterances=true`
- `enable_speaker_info=true`
- `enable_gender_detection=true`
- conservative transcription options
We should avoid options that aggressively rewrite spoken language for readability in Stage 1.
### ASR Output -> Internal Mapping
From ASR:
- `utterances[].text` -> `originalText`
- `utterances[].start_time` -> `startTime`
- `utterances[].end_time` -> `endTime`
- speaker info -> `speaker`, `speakerId`
- gender info -> `gender`
Internal notes:
- If numeric confidence is unavailable, set `confidence` to a safe default and rely on `needsReview` heuristics from other signals.
- If gender is missing, normalize to `unknown`.
- If utterances are missing, fall back to a single full-text segment only as a last resort.
## Temporary Audio URL Strategy
The flash ASR API supports `audio.data`, so Stage 1 can avoid temporary public URLs entirely.
Recommended implementation:
- extract WAV to a temp file
- read the file into base64
- send the base64 string in `audio.data`
- delete the temp file immediately after extraction
This avoids network reachability issues and is the simplest option for local and server environments.
## Logging
Add explicit Stage 1 ASR logs:
- request started
- request finished
- result mapping summary
- cleanup status
Log fields should include:
- `requestId`
- `provider`
- `resourceId`
- API endpoint
- temp audio path presence
- base64 audio size
- request duration
- utterance count
We should continue logging summaries by default, not full raw transcripts, unless the user explicitly asks for full payload logging.
## Error Handling
Stage 1 should fail clearly for:
- ffmpeg extraction failure
- ASR request rejection
- ASR service busy response
- malformed ASR result payload
Preferred behavior:
- fail Stage 1
- mark subtitle job as failed
- preserve enough context in logs to identify whether failure happened in extraction, submit, polling, or mapping
## Configuration
Add environment-driven config for the ASR API:
- app key
- access key
- resource id
- flash URL
- request timeout
- optional language hint
These should be separate from the existing Doubao LLM config because this is no longer the same provider call shape.
## Testing Strategy
### Unit Tests
- audio extraction helper returns WAV base64
- ASR submit request body matches expected schema
- ASR response parsing handles success and API error codes
- ASR result mapping produces valid `TranscriptSegment[]`
### Integration-Level Tests
- `transcriptionStage` uses ASR helper when configured
- `multiStageSubtitleGeneration` still receives valid transcript output
- `subtitleService` keeps frontend contract unchanged
### Regression Focus
- no change to translation or voice matching contracts
- no regression in subtitle job stages and progress messages
- no regression in editor auto-generation flow
## Risks
### 1. Confidence mismatch
If the ASR result does not provide a numeric confidence comparable to the current stage contract, we need a fallback policy. We should not invent precise confidence values.
### 2. Speaker metadata variance
Speaker labels and gender fields may differ from the current Stage 1 output. Downstream code already tolerates `unknown`, but we should normalize carefully.
## Rollout Plan
1. Implement the flash ASR client and mapping behind Stage 1.
2. Keep old transcription path available behind a feature flag or fallback branch during transition.
3. Validate with the known failing sample video.
4. Remove the old direct multimodal transcription path once logs and results are stable.
## Success Criteria
- Stage 1 no longer sends video content to a general model for transcription.
- Logs show flash ASR request lifecycle for each transcription request.
- The known failing sample produces original dialogue closer to the ground truth than the current Stage 1.
- Downstream translation and voice stages continue to work without UI contract changes.