songjvcheng/video_translate

Fork 0

Song367 85065cbca3

Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s

Details

Build multi-stage subtitle and dubbing pipeline

2026-03-20 20:55:40 +08:00

9.1 KiB

Raw Blame History

4+1 Subtitle Pipeline Design

Goal

Replace the current one-shot subtitle generation flow with a staged 4+1 pipeline so transcription fidelity, translation quality, and voice selection can be improved independently.

Scope

Redesign the backend subtitle generation pipeline around five explicit stages.
Keep the current upload flow, async job flow, and editor entry points intact for the first implementation pass.
Preserve the current final payload shape for the editor, while adding richer intermediate metadata for debugging and review.
Keep Doubao and Gemini provider support, but stop asking one model call to do transcription, translation, and voice matching in a single response.

Non-Goals

Do not add a new human review UI in this step.
Do not replace the current editor or dubbing UI.
Do not require a new third-party ASR vendor for the first pass. Stage 1 can still use the current multimodal provider if its prompt and output contract are narrowed to transcription only.

Problem Summary

The current pipeline asks a single provider call to:

watch and listen to the video
transcribe dialogue
split subtitle segments
translate to English
translate to the TTS language
infer speaker metadata
select a voice id

This creates two major problems:

When the model mishears the original dialogue, every downstream field is wrong.
It is hard to tell whether a bad result came from transcription, segmentation, translation, or voice matching.

The new design fixes that by isolating each responsibility.

Design

Stage overview

The new pipeline contains four production stages plus one validation stage:

Stage 1: Transcription
Stage 2: Segmentation
Stage 3: Translation
Stage 4: Voice Matching
Stage 5: Validation

Each stage receives a narrow input contract and returns a narrow output contract. Later stages must never invent or overwrite core facts from earlier stages.

Stage 1: Transcription

Purpose

Extract the source dialogue from the video as faithfully as possible.

Input

local video path or remote fileId
provider configuration
request id

Output

interface TranscriptSegment {
  id: string;
  startTime: number;
  endTime: number;
  originalText: string;
  speakerId: string;
  speaker?: string;
  gender?: 'male' | 'female' | 'unknown';
  confidence?: number;
  needsReview?: boolean;
}

Rules

Only transcribe audible dialogue.
Do not translate.
Do not rewrite or polish.
Prefer conservative output when unclear.
Mark low-confidence segments with needsReview.

Notes

For the first pass, this stage can still call the current multimodal provider, but with a transcription-only prompt and schema. That gives the pipeline separation immediately without forcing a provider migration on day one.

Stage 2: Segmentation

Purpose

Turn raw transcript segments into subtitle-friendly chunks without changing meaning.

Input

TranscriptSegment[]

Output

interface SegmentedSubtitle {
  id: string;
  startTime: number;
  endTime: number;
  originalText: string;
  speakerId: string;
  speaker?: string;
  gender?: 'male' | 'female' | 'unknown';
  confidence?: number;
  needsReview?: boolean;
}

Rules

May split or merge segments for readability.
Must not paraphrase originalText.
Must preserve chronological order and non-overlap.
Should reuse existing normalization and sentence reconstruction helpers where possible.

Notes

This stage should absorb logic that is currently mixed between subtitlePipeline.ts and provider prompts.

Stage 3: Translation

Purpose

Translate already-confirmed source dialogue into display subtitles and dubbing text.

Input

SegmentedSubtitle[]
subtitle language settings
TTS language

Output

interface TranslatedSubtitle extends SegmentedSubtitle {
  translatedText: string;
  ttsText: string;
  ttsLanguage: string;
}

Rules

translatedText is always English for on-screen subtitles.
ttsText is always the requested TTS language.
Translation must derive from originalText only.
This stage must not edit timestamps or speaker identity.

Stage 4: Voice Matching

Purpose

Assign the most suitable voiceId to each subtitle segment.

Input

TranslatedSubtitle[]
available voices for ttsLanguage

Output

interface VoiceMatchedSubtitle extends TranslatedSubtitle {
  voiceId: string;
}

Rules

Only select from the provided voice catalog.
Use speaker, gender, and tone hints when available.
Must not change transcript or translation fields.

Stage 5: Validation

Purpose

Check internal consistency before returning the final result to the editor.

Input

VoiceMatchedSubtitle[]

Output

interface ValidationIssue {
  subtitleId: string;
  code:
    | 'low_confidence_transcript'
    | 'timing_overlap'
    | 'missing_tts_text'
    | 'voice_language_mismatch'
    | 'empty_translation';
  message: string;
  severity: 'warning' | 'error';
}

Rules

Do not rewrite content in this stage.
Report warnings and errors separately.
Only block final success on true contract failures, not soft quality warnings.

Data Model

Final subtitle shape

The editor should continue to receive SubtitlePipelineResult, but the result will now be built from staged outputs rather than a single provider response.

Existing final subtitle fields remain:

id
startTime
endTime
originalText
translatedText
ttsText
ttsLanguage
speaker
speakerId
confidence
voiceId

New metadata

Add optional pipeline metadata to the final result:

interface SubtitlePipelineDiagnostics {
  validationIssues?: ValidationIssue[];
  stageDurationsMs?: Partial<Record<'transcription' | 'segmentation' | 'translation' | 'voiceMatching' | 'validation', number>>;
}

This metadata is primarily for logging, debugging, and future review UI. The editor does not need to block on it.

Runtime Architecture

New server modules

Add stage-focused modules under a new folder:

src/server/subtitleStages/transcriptionStage.ts
src/server/subtitleStages/segmentationStage.ts
src/server/subtitleStages/translationStage.ts
src/server/subtitleStages/voiceMatchingStage.ts
src/server/subtitleStages/validationStage.ts

Add one orchestrator:

src/server/multiStageSubtitleGeneration.ts

Existing modules to adapt

src/server/subtitleGeneration.ts
- stop calling one all-in-one generator
- call the new orchestrator instead
src/server/videoSubtitleGeneration.ts
- shrink into a stage-specific transcription helper or split its reusable provider code into lower-level helpers
src/server/subtitlePipeline.ts
- reuse normalization logic in Stage 2 and final payload assembly
src/types.ts
- add stage-specific types and validation metadata
server.ts
- optionally expose finer async job progress messages

Job Progress

The async job API should keep the same public contract but expose more precise stage labels:

transcribing
segmenting
translating
matching_voice
validating

This can be done either by extending SubtitleJobStage or mapping internal stage names back to the existing progress system.

Error Handling

Hard failures

unreadable video input
provider request failure
invalid stage output schema
no subtitles returned after transcription
no valid voiceId in Stage 4

These should fail the async job.

Soft warnings

low transcript confidence
suspected mistranscription
missing speaker gender
fallback default voice applied

These should be attached to diagnostics and surfaced later, but should not block the result.

Testing

Unit tests

Stage 1 prompt and parsing tests
Stage 2 segmentation behavior tests
Stage 3 translation contract tests
Stage 4 voice catalog matching tests
Stage 5 validation issue generation tests

Integration tests

full orchestration success path
transcription low-confidence warning path
translation failure path
voice mismatch validation path
async job progress updates across all five stages

Rollout Strategy

Phase 1

Add stage contracts and orchestrator behind the existing /generate-subtitles entry point.
Keep the final API shape stable.

Phase 2

Store validation issues in the result payload.
Surface review warnings in the editor.

Phase 3

Optionally swap Stage 1 to a stronger dedicated ASR service without changing translation, voice matching, or editor code.

Recommendation

Implement the 4+1 pipeline in the backend first, while keeping the current frontend contract stable. That gives immediate gains in debuggability and transcription discipline, and it creates a clean seam for future ASR upgrades.

9.1 KiB Raw Blame History

4+1 Subtitle Pipeline Design

Goal

Scope

Non-Goals

Problem Summary

Design

Stage overview

Stage 1: Transcription

Stage 2: Segmentation

Stage 3: Translation

Stage 4: Voice Matching

Stage 5: Validation

Data Model

Final subtitle shape

New metadata

Runtime Architecture

New server modules

Existing modules to adapt

Job Progress

Error Handling

Hard failures

Soft warnings

Testing

Unit tests

Integration tests

Rollout Strategy

Phase 1

Phase 2

Phase 3

Recommendation

9.1 KiB

Raw Blame History