All checks were successful
Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s
234 lines
7.7 KiB
Markdown
234 lines
7.7 KiB
Markdown
# Volcengine ASR Stage-1 Replacement Design
|
|
|
|
**Date:** 2026-03-20
|
|
|
|
**Goal**
|
|
|
|
Replace the current Stage 1 `transcription` agent with Volcengine's flash ASR API using `audio.data` base64 input so original dialogue recognition is based on dedicated ASR instead of a general multimodal model.
|
|
|
|
## Problem
|
|
|
|
The current Stage 1 pipeline uses a general model to transcribe dialogue from uploaded media. Even after changing the request to audio-only input, recognition quality is still limited by a model whose primary job is not ASR. When Stage 1 drifts from the real dialogue, the downstream translation and TTS stages faithfully amplify the mistake.
|
|
|
|
We need a Stage 1 that is optimized for:
|
|
|
|
- faithful speech recognition
|
|
- utterance-level timestamps
|
|
- speaker separation
|
|
- stable, repeatable audio-only behavior
|
|
|
|
## Recommendation
|
|
|
|
Adopt Volcengine's flash ASR API only for Stage 1 and keep the rest of the `4+1` pipeline unchanged:
|
|
|
|
1. `Transcription` -> Volcengine ASR
|
|
2. `Segmentation` -> existing local segmenter
|
|
3. `Translation` -> existing LLM translation stage
|
|
4. `Voice Matching` -> existing matcher
|
|
5. `Validation` -> existing validator
|
|
|
|
This gives us a better ASR foundation without rewriting the rest of the subtitle stack.
|
|
|
|
## Why This Fits
|
|
|
|
- The API is purpose-built for recorded audio recognition.
|
|
- It returns utterance-level timing data that maps well to our subtitle model.
|
|
- It supports speaker info and gender detection, which match our Stage 1 output shape.
|
|
- It accepts `audio.data` as base64, which avoids temporary public URL hosting.
|
|
- It returns in a single request, so Stage 1 becomes simpler than the standard submit/query flow.
|
|
|
|
## Scope
|
|
|
|
### In Scope
|
|
|
|
- Replace Stage 1 transcription provider for Doubao/Volcengine path with flash ASR API.
|
|
- Extract audio from uploaded or temp video before Stage 1.
|
|
- Send extracted audio as `audio.data` base64 in a single ASR request.
|
|
- Map ASR `utterances` into internal `TranscriptSegment[]`.
|
|
- Preserve existing downstream stages and existing editor UI contract.
|
|
- Add detailed Stage 1 logs for ASR submit/poll/result mapping.
|
|
|
|
### Out of Scope
|
|
|
|
- Replacing translation, TTS, or voice matching providers.
|
|
- Changing the editor data contract beyond Stage 1 diagnostics.
|
|
- Reworking trim/export behavior outside Stage 1 input preparation.
|
|
- Automatic fallback to another ASR vendor in this first iteration.
|
|
|
|
## Current State
|
|
|
|
Relevant files:
|
|
|
|
- [transcriptionStage.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/transcriptionStage.ts)
|
|
- [multiStageSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/multiStageSubtitleGeneration.ts)
|
|
- [subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts)
|
|
- [subtitleService.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/services/subtitleService.ts)
|
|
- [stageTypes.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/stageTypes.ts)
|
|
|
|
Current Stage 1 responsibilities:
|
|
|
|
- extract audio locally when using Doubao
|
|
- call a model endpoint directly
|
|
- parse model JSON into `TranscriptSegment[]`
|
|
|
|
This means recognition logic and model prompt logic are still coupled together in one module.
|
|
|
|
## Target Architecture
|
|
|
|
### Stage 1 Flow
|
|
|
|
1. Receive video file on server.
|
|
2. Extract normalized WAV audio using `ffmpeg`.
|
|
3. Base64-encode the WAV audio.
|
|
4. Send the audio to `recognize/flash`.
|
|
5. Map `result.utterances` into `TranscriptSegment[]`.
|
|
6. Pass those segments to the existing `segmentation` stage.
|
|
|
|
### Proposed Modules
|
|
|
|
- `src/server/subtitleStages/transcriptionStage.ts`
|
|
- keep the stage entrypoint
|
|
- delegate provider-specific ASR work to a helper
|
|
- `src/server/volcengineAsr.ts`
|
|
- send flash recognition request
|
|
- parse and massage API payloads
|
|
|
|
## Input and Output Mapping
|
|
|
|
### ASR Input
|
|
|
|
Stage 1 should send:
|
|
|
|
- extracted WAV audio
|
|
- language hint when configured
|
|
- `show_utterances=true`
|
|
- `enable_speaker_info=true`
|
|
- `enable_gender_detection=true`
|
|
- conservative transcription options
|
|
|
|
We should avoid options that aggressively rewrite spoken language for readability in Stage 1.
|
|
|
|
### ASR Output -> Internal Mapping
|
|
|
|
From ASR:
|
|
|
|
- `utterances[].text` -> `originalText`
|
|
- `utterances[].start_time` -> `startTime`
|
|
- `utterances[].end_time` -> `endTime`
|
|
- speaker info -> `speaker`, `speakerId`
|
|
- gender info -> `gender`
|
|
|
|
Internal notes:
|
|
|
|
- If numeric confidence is unavailable, set `confidence` to a safe default and rely on `needsReview` heuristics from other signals.
|
|
- If gender is missing, normalize to `unknown`.
|
|
- If utterances are missing, fall back to a single full-text segment only as a last resort.
|
|
|
|
## Temporary Audio URL Strategy
|
|
|
|
The flash ASR API supports `audio.data`, so Stage 1 can avoid temporary public URLs entirely.
|
|
|
|
Recommended implementation:
|
|
|
|
- extract WAV to a temp file
|
|
- read the file into base64
|
|
- send the base64 string in `audio.data`
|
|
- delete the temp file immediately after extraction
|
|
|
|
This avoids network reachability issues and is the simplest option for local and server environments.
|
|
|
|
## Logging
|
|
|
|
Add explicit Stage 1 ASR logs:
|
|
|
|
- request started
|
|
- request finished
|
|
- result mapping summary
|
|
- cleanup status
|
|
|
|
Log fields should include:
|
|
|
|
- `requestId`
|
|
- `provider`
|
|
- `resourceId`
|
|
- API endpoint
|
|
- temp audio path presence
|
|
- base64 audio size
|
|
- request duration
|
|
- utterance count
|
|
|
|
We should continue logging summaries by default, not full raw transcripts, unless the user explicitly asks for full payload logging.
|
|
|
|
## Error Handling
|
|
|
|
Stage 1 should fail clearly for:
|
|
|
|
- ffmpeg extraction failure
|
|
- ASR request rejection
|
|
- ASR service busy response
|
|
- malformed ASR result payload
|
|
|
|
Preferred behavior:
|
|
|
|
- fail Stage 1
|
|
- mark subtitle job as failed
|
|
- preserve enough context in logs to identify whether failure happened in extraction, submit, polling, or mapping
|
|
|
|
## Configuration
|
|
|
|
Add environment-driven config for the ASR API:
|
|
|
|
- app key
|
|
- access key
|
|
- resource id
|
|
- flash URL
|
|
- request timeout
|
|
- optional language hint
|
|
|
|
These should be separate from the existing Doubao LLM config because this is no longer the same provider call shape.
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
|
|
- audio extraction helper returns WAV base64
|
|
- ASR submit request body matches expected schema
|
|
- ASR response parsing handles success and API error codes
|
|
- ASR result mapping produces valid `TranscriptSegment[]`
|
|
|
|
### Integration-Level Tests
|
|
|
|
- `transcriptionStage` uses ASR helper when configured
|
|
- `multiStageSubtitleGeneration` still receives valid transcript output
|
|
- `subtitleService` keeps frontend contract unchanged
|
|
|
|
### Regression Focus
|
|
|
|
- no change to translation or voice matching contracts
|
|
- no regression in subtitle job stages and progress messages
|
|
- no regression in editor auto-generation flow
|
|
|
|
## Risks
|
|
|
|
### 1. Confidence mismatch
|
|
|
|
If the ASR result does not provide a numeric confidence comparable to the current stage contract, we need a fallback policy. We should not invent precise confidence values.
|
|
|
|
### 2. Speaker metadata variance
|
|
|
|
Speaker labels and gender fields may differ from the current Stage 1 output. Downstream code already tolerates `unknown`, but we should normalize carefully.
|
|
|
|
## Rollout Plan
|
|
|
|
1. Implement the flash ASR client and mapping behind Stage 1.
|
|
2. Keep old transcription path available behind a feature flag or fallback branch during transition.
|
|
3. Validate with the known failing sample video.
|
|
4. Remove the old direct multimodal transcription path once logs and results are stable.
|
|
|
|
## Success Criteria
|
|
|
|
- Stage 1 no longer sends video content to a general model for transcription.
|
|
- Logs show flash ASR request lifecycle for each transcription request.
|
|
- The known failing sample produces original dialogue closer to the ground truth than the current Stage 1.
|
|
- Downstream translation and voice stages continue to work without UI contract changes.
|