9.1 KiB
4+1 Subtitle Pipeline Design
Goal
Replace the current one-shot subtitle generation flow with a staged 4+1 pipeline so transcription fidelity, translation quality, and voice selection can be improved independently.
Scope
- Redesign the backend subtitle generation pipeline around five explicit stages.
- Keep the current upload flow, async job flow, and editor entry points intact for the first implementation pass.
- Preserve the current final payload shape for the editor, while adding richer intermediate metadata for debugging and review.
- Keep Doubao and Gemini provider support, but stop asking one model call to do transcription, translation, and voice matching in a single response.
Non-Goals
- Do not add a new human review UI in this step.
- Do not replace the current editor or dubbing UI.
- Do not require a new third-party ASR vendor for the first pass. Stage 1 can still use the current multimodal provider if its prompt and output contract are narrowed to transcription only.
Problem Summary
The current pipeline asks a single provider call to:
- watch and listen to the video
- transcribe dialogue
- split subtitle segments
- translate to English
- translate to the TTS language
- infer speaker metadata
- select a voice id
This creates two major problems:
- When the model mishears the original dialogue, every downstream field is wrong.
- It is hard to tell whether a bad result came from transcription, segmentation, translation, or voice matching.
The new design fixes that by isolating each responsibility.
Design
Stage overview
The new pipeline contains four production stages plus one validation stage:
Stage 1: TranscriptionStage 2: SegmentationStage 3: TranslationStage 4: Voice MatchingStage 5: Validation
Each stage receives a narrow input contract and returns a narrow output contract. Later stages must never invent or overwrite core facts from earlier stages.
Stage 1: Transcription
Purpose
Extract the source dialogue from the video as faithfully as possible.
Input
- local video path or remote
fileId - provider configuration
- request id
Output
interface TranscriptSegment {
id: string;
startTime: number;
endTime: number;
originalText: string;
speakerId: string;
speaker?: string;
gender?: 'male' | 'female' | 'unknown';
confidence?: number;
needsReview?: boolean;
}
Rules
- Only transcribe audible dialogue.
- Do not translate.
- Do not rewrite or polish.
- Prefer conservative output when unclear.
- Mark low-confidence segments with
needsReview.
Notes
For the first pass, this stage can still call the current multimodal provider, but with a transcription-only prompt and schema. That gives the pipeline separation immediately without forcing a provider migration on day one.
Stage 2: Segmentation
Purpose
Turn raw transcript segments into subtitle-friendly chunks without changing meaning.
Input
TranscriptSegment[]
Output
interface SegmentedSubtitle {
id: string;
startTime: number;
endTime: number;
originalText: string;
speakerId: string;
speaker?: string;
gender?: 'male' | 'female' | 'unknown';
confidence?: number;
needsReview?: boolean;
}
Rules
- May split or merge segments for readability.
- Must not paraphrase
originalText. - Must preserve chronological order and non-overlap.
- Should reuse existing normalization and sentence reconstruction helpers where possible.
Notes
This stage should absorb logic that is currently mixed between subtitlePipeline.ts and provider prompts.
Stage 3: Translation
Purpose
Translate already-confirmed source dialogue into display subtitles and dubbing text.
Input
SegmentedSubtitle[]- subtitle language settings
- TTS language
Output
interface TranslatedSubtitle extends SegmentedSubtitle {
translatedText: string;
ttsText: string;
ttsLanguage: string;
}
Rules
translatedTextis always English for on-screen subtitles.ttsTextis always the requested TTS language.- Translation must derive from
originalTextonly. - This stage must not edit timestamps or speaker identity.
Stage 4: Voice Matching
Purpose
Assign the most suitable voiceId to each subtitle segment.
Input
TranslatedSubtitle[]- available voices for
ttsLanguage
Output
interface VoiceMatchedSubtitle extends TranslatedSubtitle {
voiceId: string;
}
Rules
- Only select from the provided voice catalog.
- Use
speaker,gender, and tone hints when available. - Must not change transcript or translation fields.
Stage 5: Validation
Purpose
Check internal consistency before returning the final result to the editor.
Input
VoiceMatchedSubtitle[]
Output
interface ValidationIssue {
subtitleId: string;
code:
| 'low_confidence_transcript'
| 'timing_overlap'
| 'missing_tts_text'
| 'voice_language_mismatch'
| 'empty_translation';
message: string;
severity: 'warning' | 'error';
}
Rules
- Do not rewrite content in this stage.
- Report warnings and errors separately.
- Only block final success on true contract failures, not soft quality warnings.
Data Model
Final subtitle shape
The editor should continue to receive SubtitlePipelineResult, but the result will now be built from staged outputs rather than a single provider response.
Existing final subtitle fields remain:
idstartTimeendTimeoriginalTexttranslatedTextttsTextttsLanguagespeakerspeakerIdconfidencevoiceId
New metadata
Add optional pipeline metadata to the final result:
interface SubtitlePipelineDiagnostics {
validationIssues?: ValidationIssue[];
stageDurationsMs?: Partial<Record<'transcription' | 'segmentation' | 'translation' | 'voiceMatching' | 'validation', number>>;
}
This metadata is primarily for logging, debugging, and future review UI. The editor does not need to block on it.
Runtime Architecture
New server modules
Add stage-focused modules under a new folder:
src/server/subtitleStages/transcriptionStage.tssrc/server/subtitleStages/segmentationStage.tssrc/server/subtitleStages/translationStage.tssrc/server/subtitleStages/voiceMatchingStage.tssrc/server/subtitleStages/validationStage.ts
Add one orchestrator:
src/server/multiStageSubtitleGeneration.ts
Existing modules to adapt
- src/server/subtitleGeneration.ts
- stop calling one all-in-one generator
- call the new orchestrator instead
- src/server/videoSubtitleGeneration.ts
- shrink into a stage-specific transcription helper or split its reusable provider code into lower-level helpers
- src/server/subtitlePipeline.ts
- reuse normalization logic in Stage 2 and final payload assembly
- src/types.ts
- add stage-specific types and validation metadata
- server.ts
- optionally expose finer async job progress messages
Job Progress
The async job API should keep the same public contract but expose more precise stage labels:
transcribingsegmentingtranslatingmatching_voicevalidating
This can be done either by extending SubtitleJobStage or mapping internal stage names back to the existing progress system.
Error Handling
Hard failures
- unreadable video input
- provider request failure
- invalid stage output schema
- no subtitles returned after transcription
- no valid
voiceIdin Stage 4
These should fail the async job.
Soft warnings
- low transcript confidence
- suspected mistranscription
- missing speaker gender
- fallback default voice applied
These should be attached to diagnostics and surfaced later, but should not block the result.
Testing
Unit tests
- Stage 1 prompt and parsing tests
- Stage 2 segmentation behavior tests
- Stage 3 translation contract tests
- Stage 4 voice catalog matching tests
- Stage 5 validation issue generation tests
Integration tests
- full orchestration success path
- transcription low-confidence warning path
- translation failure path
- voice mismatch validation path
- async job progress updates across all five stages
Rollout Strategy
Phase 1
- Add stage contracts and orchestrator behind the existing
/generate-subtitlesentry point. - Keep the final API shape stable.
Phase 2
- Store validation issues in the result payload.
- Surface review warnings in the editor.
Phase 3
- Optionally swap Stage 1 to a stronger dedicated ASR service without changing translation, voice matching, or editor code.
Recommendation
Implement the 4+1 pipeline in the backend first, while keeping the current frontend contract stable. That gives immediate gains in debuggability and transcription discipline, and it creates a clean seam for future ASR upgrades.