songjvcheng/video_translate

Fork 0

Song367 85065cbca3

Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s

Details

Build multi-stage subtitle and dubbing pipeline

2026-03-20 20:55:40 +08:00

7.7 KiB

Raw Blame History

Volcengine ASR Stage-1 Replacement Design

Date: 2026-03-20

Goal

Replace the current Stage 1 transcription agent with Volcengine's flash ASR API using audio.data base64 input so original dialogue recognition is based on dedicated ASR instead of a general multimodal model.

Problem

The current Stage 1 pipeline uses a general model to transcribe dialogue from uploaded media. Even after changing the request to audio-only input, recognition quality is still limited by a model whose primary job is not ASR. When Stage 1 drifts from the real dialogue, the downstream translation and TTS stages faithfully amplify the mistake.

We need a Stage 1 that is optimized for:

faithful speech recognition
utterance-level timestamps
speaker separation
stable, repeatable audio-only behavior

Recommendation

Adopt Volcengine's flash ASR API only for Stage 1 and keep the rest of the 4+1 pipeline unchanged:

Transcription -> Volcengine ASR
Segmentation -> existing local segmenter
Translation -> existing LLM translation stage
Voice Matching -> existing matcher
Validation -> existing validator

This gives us a better ASR foundation without rewriting the rest of the subtitle stack.

Why This Fits

The API is purpose-built for recorded audio recognition.
It returns utterance-level timing data that maps well to our subtitle model.
It supports speaker info and gender detection, which match our Stage 1 output shape.
It accepts audio.data as base64, which avoids temporary public URL hosting.
It returns in a single request, so Stage 1 becomes simpler than the standard submit/query flow.

Scope

In Scope

Replace Stage 1 transcription provider for Doubao/Volcengine path with flash ASR API.
Extract audio from uploaded or temp video before Stage 1.
Send extracted audio as audio.data base64 in a single ASR request.
Map ASR utterances into internal TranscriptSegment[].
Preserve existing downstream stages and existing editor UI contract.
Add detailed Stage 1 logs for ASR submit/poll/result mapping.

Out of Scope

Replacing translation, TTS, or voice matching providers.
Changing the editor data contract beyond Stage 1 diagnostics.
Reworking trim/export behavior outside Stage 1 input preparation.
Automatic fallback to another ASR vendor in this first iteration.

Current State

Relevant files:

Current Stage 1 responsibilities:

extract audio locally when using Doubao
call a model endpoint directly
parse model JSON into TranscriptSegment[]

This means recognition logic and model prompt logic are still coupled together in one module.

Target Architecture

Stage 1 Flow

Receive video file on server.
Extract normalized WAV audio using ffmpeg.
Base64-encode the WAV audio.
Send the audio to recognize/flash.
Map result.utterances into TranscriptSegment[].
Pass those segments to the existing segmentation stage.

Proposed Modules

src/server/subtitleStages/transcriptionStage.ts
- keep the stage entrypoint
- delegate provider-specific ASR work to a helper
src/server/volcengineAsr.ts
- send flash recognition request
- parse and massage API payloads

Input and Output Mapping

ASR Input

Stage 1 should send:

extracted WAV audio
language hint when configured
show_utterances=true
enable_speaker_info=true
enable_gender_detection=true
conservative transcription options

We should avoid options that aggressively rewrite spoken language for readability in Stage 1.

ASR Output -> Internal Mapping

From ASR:

utterances[].text -> originalText
utterances[].start_time -> startTime
utterances[].end_time -> endTime
speaker info -> speaker, speakerId
gender info -> gender

Internal notes:

If numeric confidence is unavailable, set confidence to a safe default and rely on needsReview heuristics from other signals.
If gender is missing, normalize to unknown.
If utterances are missing, fall back to a single full-text segment only as a last resort.

Temporary Audio URL Strategy

The flash ASR API supports audio.data, so Stage 1 can avoid temporary public URLs entirely.

Recommended implementation:

extract WAV to a temp file
read the file into base64
send the base64 string in audio.data
delete the temp file immediately after extraction

This avoids network reachability issues and is the simplest option for local and server environments.

Logging

Add explicit Stage 1 ASR logs:

request started
request finished
result mapping summary
cleanup status

Log fields should include:

requestId
provider
resourceId
API endpoint
temp audio path presence
base64 audio size
request duration
utterance count

We should continue logging summaries by default, not full raw transcripts, unless the user explicitly asks for full payload logging.

Error Handling

Stage 1 should fail clearly for:

ffmpeg extraction failure
ASR request rejection
ASR service busy response
malformed ASR result payload

Preferred behavior:

fail Stage 1
mark subtitle job as failed
preserve enough context in logs to identify whether failure happened in extraction, submit, polling, or mapping

Configuration

Add environment-driven config for the ASR API:

app key
access key
resource id
flash URL
request timeout
optional language hint

These should be separate from the existing Doubao LLM config because this is no longer the same provider call shape.

Testing Strategy

Unit Tests

audio extraction helper returns WAV base64
ASR submit request body matches expected schema
ASR response parsing handles success and API error codes
ASR result mapping produces valid TranscriptSegment[]

Integration-Level Tests

transcriptionStage uses ASR helper when configured
multiStageSubtitleGeneration still receives valid transcript output
subtitleService keeps frontend contract unchanged

Regression Focus

no change to translation or voice matching contracts
no regression in subtitle job stages and progress messages
no regression in editor auto-generation flow

Risks

1. Confidence mismatch

If the ASR result does not provide a numeric confidence comparable to the current stage contract, we need a fallback policy. We should not invent precise confidence values.

2. Speaker metadata variance

Speaker labels and gender fields may differ from the current Stage 1 output. Downstream code already tolerates unknown, but we should normalize carefully.

Rollout Plan

Implement the flash ASR client and mapping behind Stage 1.
Keep old transcription path available behind a feature flag or fallback branch during transition.
Validate with the known failing sample video.
Remove the old direct multimodal transcription path once logs and results are stable.

Success Criteria

Stage 1 no longer sends video content to a general model for transcription.
Logs show flash ASR request lifecycle for each transcription request.
The known failing sample produces original dialogue closer to the ground truth than the current Stage 1.
Downstream translation and voice stages continue to work without UI contract changes.

7.7 KiB Raw Blame History