video_translate/docs/plans/2026-03-20-volcengine-asr-transcription-design.md
Song367 85065cbca3
All checks were successful
Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s
Build multi-stage subtitle and dubbing pipeline
2026-03-20 20:55:40 +08:00

7.7 KiB

Volcengine ASR Stage-1 Replacement Design

Date: 2026-03-20

Goal

Replace the current Stage 1 transcription agent with Volcengine's flash ASR API using audio.data base64 input so original dialogue recognition is based on dedicated ASR instead of a general multimodal model.

Problem

The current Stage 1 pipeline uses a general model to transcribe dialogue from uploaded media. Even after changing the request to audio-only input, recognition quality is still limited by a model whose primary job is not ASR. When Stage 1 drifts from the real dialogue, the downstream translation and TTS stages faithfully amplify the mistake.

We need a Stage 1 that is optimized for:

  • faithful speech recognition
  • utterance-level timestamps
  • speaker separation
  • stable, repeatable audio-only behavior

Recommendation

Adopt Volcengine's flash ASR API only for Stage 1 and keep the rest of the 4+1 pipeline unchanged:

  1. Transcription -> Volcengine ASR
  2. Segmentation -> existing local segmenter
  3. Translation -> existing LLM translation stage
  4. Voice Matching -> existing matcher
  5. Validation -> existing validator

This gives us a better ASR foundation without rewriting the rest of the subtitle stack.

Why This Fits

  • The API is purpose-built for recorded audio recognition.
  • It returns utterance-level timing data that maps well to our subtitle model.
  • It supports speaker info and gender detection, which match our Stage 1 output shape.
  • It accepts audio.data as base64, which avoids temporary public URL hosting.
  • It returns in a single request, so Stage 1 becomes simpler than the standard submit/query flow.

Scope

In Scope

  • Replace Stage 1 transcription provider for Doubao/Volcengine path with flash ASR API.
  • Extract audio from uploaded or temp video before Stage 1.
  • Send extracted audio as audio.data base64 in a single ASR request.
  • Map ASR utterances into internal TranscriptSegment[].
  • Preserve existing downstream stages and existing editor UI contract.
  • Add detailed Stage 1 logs for ASR submit/poll/result mapping.

Out of Scope

  • Replacing translation, TTS, or voice matching providers.
  • Changing the editor data contract beyond Stage 1 diagnostics.
  • Reworking trim/export behavior outside Stage 1 input preparation.
  • Automatic fallback to another ASR vendor in this first iteration.

Current State

Relevant files:

Current Stage 1 responsibilities:

  • extract audio locally when using Doubao
  • call a model endpoint directly
  • parse model JSON into TranscriptSegment[]

This means recognition logic and model prompt logic are still coupled together in one module.

Target Architecture

Stage 1 Flow

  1. Receive video file on server.
  2. Extract normalized WAV audio using ffmpeg.
  3. Base64-encode the WAV audio.
  4. Send the audio to recognize/flash.
  5. Map result.utterances into TranscriptSegment[].
  6. Pass those segments to the existing segmentation stage.

Proposed Modules

  • src/server/subtitleStages/transcriptionStage.ts
    • keep the stage entrypoint
    • delegate provider-specific ASR work to a helper
  • src/server/volcengineAsr.ts
    • send flash recognition request
    • parse and massage API payloads

Input and Output Mapping

ASR Input

Stage 1 should send:

  • extracted WAV audio
  • language hint when configured
  • show_utterances=true
  • enable_speaker_info=true
  • enable_gender_detection=true
  • conservative transcription options

We should avoid options that aggressively rewrite spoken language for readability in Stage 1.

ASR Output -> Internal Mapping

From ASR:

  • utterances[].text -> originalText
  • utterances[].start_time -> startTime
  • utterances[].end_time -> endTime
  • speaker info -> speaker, speakerId
  • gender info -> gender

Internal notes:

  • If numeric confidence is unavailable, set confidence to a safe default and rely on needsReview heuristics from other signals.
  • If gender is missing, normalize to unknown.
  • If utterances are missing, fall back to a single full-text segment only as a last resort.

Temporary Audio URL Strategy

The flash ASR API supports audio.data, so Stage 1 can avoid temporary public URLs entirely.

Recommended implementation:

  • extract WAV to a temp file
  • read the file into base64
  • send the base64 string in audio.data
  • delete the temp file immediately after extraction

This avoids network reachability issues and is the simplest option for local and server environments.

Logging

Add explicit Stage 1 ASR logs:

  • request started
  • request finished
  • result mapping summary
  • cleanup status

Log fields should include:

  • requestId
  • provider
  • resourceId
  • API endpoint
  • temp audio path presence
  • base64 audio size
  • request duration
  • utterance count

We should continue logging summaries by default, not full raw transcripts, unless the user explicitly asks for full payload logging.

Error Handling

Stage 1 should fail clearly for:

  • ffmpeg extraction failure
  • ASR request rejection
  • ASR service busy response
  • malformed ASR result payload

Preferred behavior:

  • fail Stage 1
  • mark subtitle job as failed
  • preserve enough context in logs to identify whether failure happened in extraction, submit, polling, or mapping

Configuration

Add environment-driven config for the ASR API:

  • app key
  • access key
  • resource id
  • flash URL
  • request timeout
  • optional language hint

These should be separate from the existing Doubao LLM config because this is no longer the same provider call shape.

Testing Strategy

Unit Tests

  • audio extraction helper returns WAV base64
  • ASR submit request body matches expected schema
  • ASR response parsing handles success and API error codes
  • ASR result mapping produces valid TranscriptSegment[]

Integration-Level Tests

  • transcriptionStage uses ASR helper when configured
  • multiStageSubtitleGeneration still receives valid transcript output
  • subtitleService keeps frontend contract unchanged

Regression Focus

  • no change to translation or voice matching contracts
  • no regression in subtitle job stages and progress messages
  • no regression in editor auto-generation flow

Risks

1. Confidence mismatch

If the ASR result does not provide a numeric confidence comparable to the current stage contract, we need a fallback policy. We should not invent precise confidence values.

2. Speaker metadata variance

Speaker labels and gender fields may differ from the current Stage 1 output. Downstream code already tolerates unknown, but we should normalize carefully.

Rollout Plan

  1. Implement the flash ASR client and mapping behind Stage 1.
  2. Keep old transcription path available behind a feature flag or fallback branch during transition.
  3. Validate with the known failing sample video.
  4. Remove the old direct multimodal transcription path once logs and results are stable.

Success Criteria

  • Stage 1 no longer sends video content to a general model for transcription.
  • Logs show flash ASR request lifecycle for each transcription request.
  • The known failing sample produces original dialogue closer to the ground truth than the current Stage 1.
  • Downstream translation and voice stages continue to work without UI contract changes.