8.2 KiB
Precise Dialogue Localization Design
Date: 2026-03-17
Goal: Upgrade the subtitle pipeline so sentence boundaries are more accurate, word-level timings are available, and speaker attribution is based on audio rather than LLM guesses.
Current State
The current implementation has two subtitle generation paths:
- The primary path in
server.tsextracts audio, calls Whisper withtimestamp_granularities: ['segment'], then asks an LLM to translate and inferspeakerandgender. - The fallback path in
src/services/geminiService.tsuses Gemini to infer subtitles from video or sampled frames.
This is enough for rough subtitle generation, but it has three hard limits:
- Sentence timing is only segment-level, so start and end times drift at pause boundaries.
- Word-level timestamps do not exist, so precise editing and karaoke-style highlighting are impossible.
- Speaker identity is inferred from text, not measured from audio, so diarization quality is unreliable.
Chosen Approach
Adopt a high-precision pipeline with a dedicated alignment layer:
- Extract clean mono audio from the uploaded video.
- Use voice activity detection (VAD) to isolate speech regions.
- Run ASR for rough transcription.
- Run forced alignment to refine every word boundary against the audio.
- Run speaker diarization to assign stable
speakerIdvalues. - Rebuild editable subtitle sentences from aligned words.
- Translate only the sentence text while preserving timestamps and speaker assignments.
The existing Node service remains the entry point, but it becomes an orchestration layer instead of doing all timing work itself.
Architecture
Frontend
The React editor continues to call /api/process-audio-pipeline, but it now receives richer subtitle objects:
- Sentence-level timing for the timeline.
- Word-level timing for precise playback feedback.
- Stable
speakerIdvalues for speaker-aware UI and voice assignment.
The current editor can remain backward compatible by continuing to render sentence-level fields first and gradually enabling word-level behavior.
Node Orchestration Layer
server.ts keeps responsibility for:
- Receiving uploaded video data.
- Extracting audio with FFmpeg.
- Calling the alignment service.
- Translating sentence text.
- Returning a normalized payload to the frontend.
The Node layer must not allow translation to rewrite timing or speaker assignments.
Alignment Layer
This layer owns all timing-critical operations:
- VAD
- ASR
- Forced alignment
- Speaker diarization
It can be implemented as a local Python service or a separately managed service as long as it returns deterministic machine-readable JSON.
Data Model
The current Subtitle type should be extended rather than replaced.
type WordTiming = {
text: string;
startTime: number;
endTime: number;
speakerId: string;
confidence: number;
};
type Subtitle = {
id: string;
startTime: number;
endTime: number;
originalText: string;
translatedText: string;
speaker: string;
speakerId: string;
voiceId: string;
words: WordTiming[];
confidence: number;
audioUrl?: string;
volume?: number;
};
type SpeakerTrack = {
speakerId: string;
label: string;
gender?: 'male' | 'female' | 'unknown';
};
Rules:
speakerIdis the stable machine identifier, for examplespk_0.speakeris a user-facing label and can be renamed.- Sentence
startTimeandendTimeare derived from the first and last aligned words.
Processing Rules
Audio Preparation
- Convert uploaded video to
16kHzmono WAV. - Optionally create a denoised or vocal-enhanced copy when the source contains heavy music.
VAD
Use VAD to identify speech windows and pad each detected region by about 0.2s.
ASR and Forced Alignment
- Use ASR for text hypotheses and rough word order.
- Use forced alignment to compute accurate
startTimeandendTimefor each word. - Treat forced alignment as the source of truth for timing whenever available.
Diarization
- Run diarization separately and produce speaker segments.
- Assign each word to the speaker with the highest overlap.
- If a sentence crosses speakers, split it rather than forcing a mixed-speaker sentence.
Sentence Reconstruction
Build sentence subtitles from words using conservative rules:
- Keep words together only when
speakerIdis the same. - Split when adjacent word gaps exceed
0.45s. - Split when sentence duration would exceed
8s. - Split on strong punctuation or long pauses.
- Avoid returning sentences shorter than
0.6sunless the source is actually brief.
API Design
Reuse /api/process-audio-pipeline, but upgrade its payload to:
{
"subtitles": [],
"speakers": [],
"sourceLanguage": "zh",
"targetLanguage": "en",
"duration": 123.45,
"quality": "full",
"alignmentEngine": "whisperx+pyannote"
}
Quality levels:
full: sentence timings, word timings, and diarization are all available.partial: word timings are available but diarization is missing or unreliable.fallback: high-precision alignment failed, so the app returned rough timing from the existing path.
Frontend Behavior
The current editor in src/components/EditorScreen.tsx should evolve incrementally:
- Keep the existing sentence-based timeline as the default view.
- Add word-level highlighting during playback when
wordsexist. - Add speaker-aware styling and filtering when
speakersexist. - Preserve manual timeline editing and snap dragged sentence edges to nearest word boundaries when possible.
Fallback behavior:
- If
qualityisfull, enable all precision UI. - If
qualityispartial, disable speaker-specific UI and keep timing features. - If
qualityisfallback, continue with the current editor and show a low-precision notice.
Error Handling and Degradation
The product must remain usable even when the high-precision path is incomplete.
- If forced alignment fails, return sentence-level ASR output instead of failing the whole request.
- If diarization fails, keep timings and mark
speakerIdasunknown. - If translation fails, return original text with timings intact.
- If the alignment layer is unavailable, fall back to the existing visual pipeline and set
quality: "fallback". - Preserve low-confidence words and expose their confidence rather than dropping them silently.
Testing Strategy
Coverage should focus on deterministic logic:
- Sentence reconstruction from aligned words.
- Speaker assignment from overlapping diarization segments.
- API normalization and fallback handling.
- Frontend word-highlighting and snapping helpers.
End-to-end manual verification should include:
- Single-speaker clip with pauses.
- Two-speaker dialogue with interruptions.
- Music-heavy clip.
- Alignment failure fallback.
Rollout Plan
- Extend types and response normalization first.
- Introduce the alignment adapter behind a feature flag or environment guard.
- Return richer payloads while keeping the current UI backward compatible.
- Add word-level highlighting and speaker-aware UI after the backend contract stabilizes.
Constraints and Notes
- This workspace is not a Git repository, so the required design-document commit could not be performed here.
- The current project does not yet include a test runner, so the implementation plan includes test infrastructure setup before feature work.
Implementation Status
Implemented in this workspace:
- Test infrastructure using Vitest, jsdom, and Testing Library.
- Shared subtitle pipeline helpers for normalization, sentence reconstruction, speaker assignment, word highlighting, and timeline snapping.
- A backend subtitle orchestration layer plus an alignment-service adapter boundary for local ASR / alignment backends.
- Gemini-based sentence translation in the audio pipeline, without relying on OpenAI for ASR or translation.
- Frontend pipeline mapping, precision notices, word-level playback feedback, and speaker-aware presentation.
Automated verification completed:
npm test -- --runnpm run lintnpm run build
Manual verification still pending:
- Single-speaker clip with pauses.
- Two-speaker dialogue with interruptions.
- Music-heavy clip.
- Alignment-service unavailable fallback using a real upload.