video_translate/docs/plans/2026-03-20-volcengine-asr-transcription-design.md

# Volcengine ASR Stage-1 Replacement Design

**Date:** 2026-03-20

**Goal**

Replace the current Stage 1 `transcription` agent with Volcengine's flash ASR API using `audio.data` base64 input so original dialogue recognition is based on dedicated ASR instead of a general multimodal model.

## Problem

The current Stage 1 pipeline uses a general model to transcribe dialogue from uploaded media. Even after changing the request to audio-only input, recognition quality is still limited by a model whose primary job is not ASR. When Stage 1 drifts from the real dialogue, the downstream translation and TTS stages faithfully amplify the mistake.

We need a Stage 1 that is optimized for:

- faithful speech recognition
- utterance-level timestamps
- speaker separation
- stable, repeatable audio-only behavior

## Recommendation

Adopt Volcengine's flash ASR API only for Stage 1 and keep the rest of the `4+1` pipeline unchanged:

1. `Transcription` -> Volcengine ASR
2. `Segmentation` -> existing local segmenter
3. `Translation` -> existing LLM translation stage
4. `Voice Matching` -> existing matcher
5. `Validation` -> existing validator

This gives us a better ASR foundation without rewriting the rest of the subtitle stack.

## Why This Fits

- The API is purpose-built for recorded audio recognition.
- It returns utterance-level timing data that maps well to our subtitle model.
- It supports speaker info and gender detection, which match our Stage 1 output shape.
- It accepts `audio.data` as base64, which avoids temporary public URL hosting.
- It returns in a single request, so Stage 1 becomes simpler than the standard submit/query flow.

## Scope

### In Scope

- Replace Stage 1 transcription provider for Doubao/Volcengine path with flash ASR API.
- Extract audio from uploaded or temp video before Stage 1.
- Send extracted audio as `audio.data` base64 in a single ASR request.
- Map ASR `utterances` into internal `TranscriptSegment[]`.
- Preserve existing downstream stages and existing editor UI contract.
- Add detailed Stage 1 logs for ASR submit/poll/result mapping.

### Out of Scope

- Replacing translation, TTS, or voice matching providers.
- Changing the editor data contract beyond Stage 1 diagnostics.
- Reworking trim/export behavior outside Stage 1 input preparation.
- Automatic fallback to another ASR vendor in this first iteration.

## Current State

Relevant files:

- [transcriptionStage.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/transcriptionStage.ts)
- [multiStageSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/multiStageSubtitleGeneration.ts)
- [subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts)
- [subtitleService.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/services/subtitleService.ts)
- [stageTypes.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleStages/stageTypes.ts)

Current Stage 1 responsibilities:

- extract audio locally when using Doubao
- call a model endpoint directly
- parse model JSON into `TranscriptSegment[]`

This means recognition logic and model prompt logic are still coupled together in one module.

## Target Architecture

### Stage 1 Flow

1. Receive video file on server.
2. Extract normalized WAV audio using `ffmpeg`.
3. Base64-encode the WAV audio.
4. Send the audio to `recognize/flash`.
5. Map `result.utterances` into `TranscriptSegment[]`.
6. Pass those segments to the existing `segmentation` stage.

### Proposed Modules

- `src/server/subtitleStages/transcriptionStage.ts`
  - keep the stage entrypoint
  - delegate provider-specific ASR work to a helper
- `src/server/volcengineAsr.ts`
  - send flash recognition request
  - parse and massage API payloads

## Input and Output Mapping

### ASR Input

Stage 1 should send:

- extracted WAV audio
- language hint when configured
- `show_utterances=true`
- `enable_speaker_info=true`
- `enable_gender_detection=true`
- conservative transcription options

We should avoid options that aggressively rewrite spoken language for readability in Stage 1.

### ASR Output -> Internal Mapping

From ASR:

- `utterances[].text` -> `originalText`
- `utterances[].start_time` -> `startTime`
- `utterances[].end_time` -> `endTime`
- speaker info -> `speaker`, `speakerId`
- gender info -> `gender`

Internal notes:

- If numeric confidence is unavailable, set `confidence` to a safe default and rely on `needsReview` heuristics from other signals.
- If gender is missing, normalize to `unknown`.
- If utterances are missing, fall back to a single full-text segment only as a last resort.

## Temporary Audio URL Strategy

The flash ASR API supports `audio.data`, so Stage 1 can avoid temporary public URLs entirely.

Recommended implementation:

- extract WAV to a temp file
- read the file into base64
- send the base64 string in `audio.data`
- delete the temp file immediately after extraction

This avoids network reachability issues and is the simplest option for local and server environments.

## Logging

Add explicit Stage 1 ASR logs:

- request started
- request finished
- result mapping summary
- cleanup status

Log fields should include:

- `requestId`
- `provider`
- `resourceId`
- API endpoint
- temp audio path presence
- base64 audio size
- request duration
- utterance count

We should continue logging summaries by default, not full raw transcripts, unless the user explicitly asks for full payload logging.

## Error Handling

Stage 1 should fail clearly for:

- ffmpeg extraction failure
- ASR request rejection
- ASR service busy response
- malformed ASR result payload

Preferred behavior:

- fail Stage 1
- mark subtitle job as failed
- preserve enough context in logs to identify whether failure happened in extraction, submit, polling, or mapping

## Configuration

Add environment-driven config for the ASR API:

- app key
- access key
- resource id
- flash URL
- request timeout
- optional language hint

These should be separate from the existing Doubao LLM config because this is no longer the same provider call shape.

## Testing Strategy

### Unit Tests

- audio extraction helper returns WAV base64
- ASR submit request body matches expected schema
- ASR response parsing handles success and API error codes
- ASR result mapping produces valid `TranscriptSegment[]`

### Integration-Level Tests

- `transcriptionStage` uses ASR helper when configured
- `multiStageSubtitleGeneration` still receives valid transcript output
- `subtitleService` keeps frontend contract unchanged

### Regression Focus

- no change to translation or voice matching contracts
- no regression in subtitle job stages and progress messages
- no regression in editor auto-generation flow

## Risks

### 1. Confidence mismatch

If the ASR result does not provide a numeric confidence comparable to the current stage contract, we need a fallback policy. We should not invent precise confidence values.

### 2. Speaker metadata variance

Speaker labels and gender fields may differ from the current Stage 1 output. Downstream code already tolerates `unknown`, but we should normalize carefully.

## Rollout Plan

1. Implement the flash ASR client and mapping behind Stage 1.
2. Keep old transcription path available behind a feature flag or fallback branch during transition.
3. Validate with the known failing sample video.
4. Remove the old direct multimodal transcription path once logs and results are stable.

## Success Criteria

- Stage 1 no longer sends video content to a general model for transcription.
- Logs show flash ASR request lifecycle for each transcription request.
- The known failing sample produces original dialogue closer to the ground truth than the current Stage 1.
- Downstream translation and voice stages continue to work without UI contract changes.