video_translate/docs/plans/2026-03-19-4-plus-1-subtitle-pipeline-design.md
Song367 85065cbca3
All checks were successful
Gitea Actions Demo / Explore-Gitea-Actions (push) Successful in 1m8s
Build multi-stage subtitle and dubbing pipeline
2026-03-20 20:55:40 +08:00

349 lines
9.1 KiB
Markdown

# 4+1 Subtitle Pipeline Design
## Goal
Replace the current one-shot subtitle generation flow with a staged `4+1` pipeline so transcription fidelity, translation quality, and voice selection can be improved independently.
## Scope
- Redesign the backend subtitle generation pipeline around five explicit stages.
- Keep the current upload flow, async job flow, and editor entry points intact for the first implementation pass.
- Preserve the current final payload shape for the editor, while adding richer intermediate metadata for debugging and review.
- Keep Doubao and Gemini provider support, but stop asking one model call to do transcription, translation, and voice matching in a single response.
## Non-Goals
- Do not add a new human review UI in this step.
- Do not replace the current editor or dubbing UI.
- Do not require a new third-party ASR vendor for the first pass. Stage 1 can still use the current multimodal provider if its prompt and output contract are narrowed to transcription only.
## Problem Summary
The current pipeline asks a single provider call to:
- watch and listen to the video
- transcribe dialogue
- split subtitle segments
- translate to English
- translate to the TTS language
- infer speaker metadata
- select a voice id
This creates two major problems:
1. When the model mishears the original dialogue, every downstream field is wrong.
2. It is hard to tell whether a bad result came from transcription, segmentation, translation, or voice matching.
The new design fixes that by isolating each responsibility.
## Design
### Stage overview
The new pipeline contains four production stages plus one validation stage:
1. `Stage 1: Transcription`
2. `Stage 2: Segmentation`
3. `Stage 3: Translation`
4. `Stage 4: Voice Matching`
5. `Stage 5: Validation`
Each stage receives a narrow input contract and returns a narrow output contract. Later stages must never invent or overwrite core facts from earlier stages.
### Stage 1: Transcription
**Purpose**
Extract the source dialogue from the video as faithfully as possible.
**Input**
- local video path or remote `fileId`
- provider configuration
- request id
**Output**
```ts
interface TranscriptSegment {
id: string;
startTime: number;
endTime: number;
originalText: string;
speakerId: string;
speaker?: string;
gender?: 'male' | 'female' | 'unknown';
confidence?: number;
needsReview?: boolean;
}
```
**Rules**
- Only transcribe audible dialogue.
- Do not translate.
- Do not rewrite or polish.
- Prefer conservative output when unclear.
- Mark low-confidence segments with `needsReview`.
**Notes**
For the first pass, this stage can still call the current multimodal provider, but with a transcription-only prompt and schema. That gives the pipeline separation immediately without forcing a provider migration on day one.
### Stage 2: Segmentation
**Purpose**
Turn raw transcript segments into subtitle-friendly chunks without changing meaning.
**Input**
- `TranscriptSegment[]`
**Output**
```ts
interface SegmentedSubtitle {
id: string;
startTime: number;
endTime: number;
originalText: string;
speakerId: string;
speaker?: string;
gender?: 'male' | 'female' | 'unknown';
confidence?: number;
needsReview?: boolean;
}
```
**Rules**
- May split or merge segments for readability.
- Must not paraphrase `originalText`.
- Must preserve chronological order and non-overlap.
- Should reuse existing normalization and sentence reconstruction helpers where possible.
**Notes**
This stage should absorb logic that is currently mixed between [subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts) and provider prompts.
### Stage 3: Translation
**Purpose**
Translate already-confirmed source dialogue into display subtitles and dubbing text.
**Input**
- `SegmentedSubtitle[]`
- subtitle language settings
- TTS language
**Output**
```ts
interface TranslatedSubtitle extends SegmentedSubtitle {
translatedText: string;
ttsText: string;
ttsLanguage: string;
}
```
**Rules**
- `translatedText` is always English for on-screen subtitles.
- `ttsText` is always the requested TTS language.
- Translation must derive from `originalText` only.
- This stage must not edit timestamps or speaker identity.
### Stage 4: Voice Matching
**Purpose**
Assign the most suitable `voiceId` to each subtitle segment.
**Input**
- `TranslatedSubtitle[]`
- available voices for `ttsLanguage`
**Output**
```ts
interface VoiceMatchedSubtitle extends TranslatedSubtitle {
voiceId: string;
}
```
**Rules**
- Only select from the provided voice catalog.
- Use `speaker`, `gender`, and tone hints when available.
- Must not change transcript or translation fields.
### Stage 5: Validation
**Purpose**
Check internal consistency before returning the final result to the editor.
**Input**
- `VoiceMatchedSubtitle[]`
**Output**
```ts
interface ValidationIssue {
subtitleId: string;
code:
| 'low_confidence_transcript'
| 'timing_overlap'
| 'missing_tts_text'
| 'voice_language_mismatch'
| 'empty_translation';
message: string;
severity: 'warning' | 'error';
}
```
**Rules**
- Do not rewrite content in this stage.
- Report warnings and errors separately.
- Only block final success on true contract failures, not soft quality warnings.
## Data Model
### Final subtitle shape
The editor should continue to receive `SubtitlePipelineResult`, but the result will now be built from staged outputs rather than a single provider response.
Existing final subtitle fields remain:
- `id`
- `startTime`
- `endTime`
- `originalText`
- `translatedText`
- `ttsText`
- `ttsLanguage`
- `speaker`
- `speakerId`
- `confidence`
- `voiceId`
### New metadata
Add optional pipeline metadata to the final result:
```ts
interface SubtitlePipelineDiagnostics {
validationIssues?: ValidationIssue[];
stageDurationsMs?: Partial<Record<'transcription' | 'segmentation' | 'translation' | 'voiceMatching' | 'validation', number>>;
}
```
This metadata is primarily for logging, debugging, and future review UI. The editor does not need to block on it.
## Runtime Architecture
### New server modules
Add stage-focused modules under a new folder:
- `src/server/subtitleStages/transcriptionStage.ts`
- `src/server/subtitleStages/segmentationStage.ts`
- `src/server/subtitleStages/translationStage.ts`
- `src/server/subtitleStages/voiceMatchingStage.ts`
- `src/server/subtitleStages/validationStage.ts`
Add one orchestrator:
- `src/server/multiStageSubtitleGeneration.ts`
### Existing modules to adapt
- [src/server/subtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitleGeneration.ts)
- stop calling one all-in-one generator
- call the new orchestrator instead
- [src/server/videoSubtitleGeneration.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/videoSubtitleGeneration.ts)
- shrink into a stage-specific transcription helper or split its reusable provider code into lower-level helpers
- [src/server/subtitlePipeline.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/server/subtitlePipeline.ts)
- reuse normalization logic in Stage 2 and final payload assembly
- [src/types.ts](/E:/Downloads/ai-video-dubbing-&-translation/src/types.ts)
- add stage-specific types and validation metadata
- [server.ts](/E:/Downloads/ai-video-dubbing-&-translation/server.ts)
- optionally expose finer async job progress messages
## Job Progress
The async job API should keep the same public contract but expose more precise stage labels:
- `transcribing`
- `segmenting`
- `translating`
- `matching_voice`
- `validating`
This can be done either by extending `SubtitleJobStage` or mapping internal stage names back to the existing progress system.
## Error Handling
### Hard failures
- unreadable video input
- provider request failure
- invalid stage output schema
- no subtitles returned after transcription
- no valid `voiceId` in Stage 4
These should fail the async job.
### Soft warnings
- low transcript confidence
- suspected mistranscription
- missing speaker gender
- fallback default voice applied
These should be attached to diagnostics and surfaced later, but should not block the result.
## Testing
### Unit tests
- Stage 1 prompt and parsing tests
- Stage 2 segmentation behavior tests
- Stage 3 translation contract tests
- Stage 4 voice catalog matching tests
- Stage 5 validation issue generation tests
### Integration tests
- full orchestration success path
- transcription low-confidence warning path
- translation failure path
- voice mismatch validation path
- async job progress updates across all five stages
## Rollout Strategy
### Phase 1
- Add stage contracts and orchestrator behind the existing `/generate-subtitles` entry point.
- Keep the final API shape stable.
### Phase 2
- Store validation issues in the result payload.
- Surface review warnings in the editor.
### Phase 3
- Optionally swap Stage 1 to a stronger dedicated ASR service without changing translation, voice matching, or editor code.
## Recommendation
Implement the 4+1 pipeline in the backend first, while keeping the current frontend contract stable. That gives immediate gains in debuggability and transcription discipline, and it creates a clean seam for future ASR upgrades.