video_translate/docs/plans/2026-03-17-precise-dialogue-localization-design.md
2026-03-18 11:42:00 +08:00

8.2 KiB

Precise Dialogue Localization Design

Date: 2026-03-17

Goal: Upgrade the subtitle pipeline so sentence boundaries are more accurate, word-level timings are available, and speaker attribution is based on audio rather than LLM guesses.

Current State

The current implementation has two subtitle generation paths:

  1. The primary path in server.ts extracts audio, calls Whisper with timestamp_granularities: ['segment'], then asks an LLM to translate and infer speaker and gender.
  2. The fallback path in src/services/geminiService.ts uses Gemini to infer subtitles from video or sampled frames.

This is enough for rough subtitle generation, but it has three hard limits:

  1. Sentence timing is only segment-level, so start and end times drift at pause boundaries.
  2. Word-level timestamps do not exist, so precise editing and karaoke-style highlighting are impossible.
  3. Speaker identity is inferred from text, not measured from audio, so diarization quality is unreliable.

Chosen Approach

Adopt a high-precision pipeline with a dedicated alignment layer:

  1. Extract clean mono audio from the uploaded video.
  2. Use voice activity detection (VAD) to isolate speech regions.
  3. Run ASR for rough transcription.
  4. Run forced alignment to refine every word boundary against the audio.
  5. Run speaker diarization to assign stable speakerId values.
  6. Rebuild editable subtitle sentences from aligned words.
  7. Translate only the sentence text while preserving timestamps and speaker assignments.

The existing Node service remains the entry point, but it becomes an orchestration layer instead of doing all timing work itself.

Architecture

Frontend

The React editor continues to call /api/process-audio-pipeline, but it now receives richer subtitle objects:

  1. Sentence-level timing for the timeline.
  2. Word-level timing for precise playback feedback.
  3. Stable speakerId values for speaker-aware UI and voice assignment.

The current editor can remain backward compatible by continuing to render sentence-level fields first and gradually enabling word-level behavior.

Node Orchestration Layer

server.ts keeps responsibility for:

  1. Receiving uploaded video data.
  2. Extracting audio with FFmpeg.
  3. Calling the alignment service.
  4. Translating sentence text.
  5. Returning a normalized payload to the frontend.

The Node layer must not allow translation to rewrite timing or speaker assignments.

Alignment Layer

This layer owns all timing-critical operations:

  1. VAD
  2. ASR
  3. Forced alignment
  4. Speaker diarization

It can be implemented as a local Python service or a separately managed service as long as it returns deterministic machine-readable JSON.

Data Model

The current Subtitle type should be extended rather than replaced.

type WordTiming = {
  text: string;
  startTime: number;
  endTime: number;
  speakerId: string;
  confidence: number;
};

type Subtitle = {
  id: string;
  startTime: number;
  endTime: number;
  originalText: string;
  translatedText: string;
  speaker: string;
  speakerId: string;
  voiceId: string;
  words: WordTiming[];
  confidence: number;
  audioUrl?: string;
  volume?: number;
};

type SpeakerTrack = {
  speakerId: string;
  label: string;
  gender?: 'male' | 'female' | 'unknown';
};

Rules:

  1. speakerId is the stable machine identifier, for example spk_0.
  2. speaker is a user-facing label and can be renamed.
  3. Sentence startTime and endTime are derived from the first and last aligned words.

Processing Rules

Audio Preparation

  1. Convert uploaded video to 16kHz mono WAV.
  2. Optionally create a denoised or vocal-enhanced copy when the source contains heavy music.

VAD

Use VAD to identify speech windows and pad each detected region by about 0.2s.

ASR and Forced Alignment

  1. Use ASR for text hypotheses and rough word order.
  2. Use forced alignment to compute accurate startTime and endTime for each word.
  3. Treat forced alignment as the source of truth for timing whenever available.

Diarization

  1. Run diarization separately and produce speaker segments.
  2. Assign each word to the speaker with the highest overlap.
  3. If a sentence crosses speakers, split it rather than forcing a mixed-speaker sentence.

Sentence Reconstruction

Build sentence subtitles from words using conservative rules:

  1. Keep words together only when speakerId is the same.
  2. Split when adjacent word gaps exceed 0.45s.
  3. Split when sentence duration would exceed 8s.
  4. Split on strong punctuation or long pauses.
  5. Avoid returning sentences shorter than 0.6s unless the source is actually brief.

API Design

Reuse /api/process-audio-pipeline, but upgrade its payload to:

{
  "subtitles": [],
  "speakers": [],
  "sourceLanguage": "zh",
  "targetLanguage": "en",
  "duration": 123.45,
  "quality": "full",
  "alignmentEngine": "whisperx+pyannote"
}

Quality levels:

  1. full: sentence timings, word timings, and diarization are all available.
  2. partial: word timings are available but diarization is missing or unreliable.
  3. fallback: high-precision alignment failed, so the app returned rough timing from the existing path.

Frontend Behavior

The current editor in src/components/EditorScreen.tsx should evolve incrementally:

  1. Keep the existing sentence-based timeline as the default view.
  2. Add word-level highlighting during playback when words exist.
  3. Add speaker-aware styling and filtering when speakers exist.
  4. Preserve manual timeline editing and snap dragged sentence edges to nearest word boundaries when possible.

Fallback behavior:

  1. If quality is full, enable all precision UI.
  2. If quality is partial, disable speaker-specific UI and keep timing features.
  3. If quality is fallback, continue with the current editor and show a low-precision notice.

Error Handling and Degradation

The product must remain usable even when the high-precision path is incomplete.

  1. If forced alignment fails, return sentence-level ASR output instead of failing the whole request.
  2. If diarization fails, keep timings and mark speakerId as unknown.
  3. If translation fails, return original text with timings intact.
  4. If the alignment layer is unavailable, fall back to the existing visual pipeline and set quality: "fallback".
  5. Preserve low-confidence words and expose their confidence rather than dropping them silently.

Testing Strategy

Coverage should focus on deterministic logic:

  1. Sentence reconstruction from aligned words.
  2. Speaker assignment from overlapping diarization segments.
  3. API normalization and fallback handling.
  4. Frontend word-highlighting and snapping helpers.

End-to-end manual verification should include:

  1. Single-speaker clip with pauses.
  2. Two-speaker dialogue with interruptions.
  3. Music-heavy clip.
  4. Alignment failure fallback.

Rollout Plan

  1. Extend types and response normalization first.
  2. Introduce the alignment adapter behind a feature flag or environment guard.
  3. Return richer payloads while keeping the current UI backward compatible.
  4. Add word-level highlighting and speaker-aware UI after the backend contract stabilizes.

Constraints and Notes

  1. This workspace is not a Git repository, so the required design-document commit could not be performed here.
  2. The current project does not yet include a test runner, so the implementation plan includes test infrastructure setup before feature work.

Implementation Status

Implemented in this workspace:

  1. Test infrastructure using Vitest, jsdom, and Testing Library.
  2. Shared subtitle pipeline helpers for normalization, sentence reconstruction, speaker assignment, word highlighting, and timeline snapping.
  3. A backend subtitle orchestration layer plus an alignment-service adapter boundary for local ASR / alignment backends.
  4. Gemini-based sentence translation in the audio pipeline, without relying on OpenAI for ASR or translation.
  5. Frontend pipeline mapping, precision notices, word-level playback feedback, and speaker-aware presentation.

Automated verification completed:

  1. npm test -- --run
  2. npm run lint
  3. npm run build

Manual verification still pending:

  1. Single-speaker clip with pauses.
  2. Two-speaker dialogue with interruptions.
  3. Music-heavy clip.
  4. Alignment-service unavailable fallback using a real upload.