video_translate/docs/plans/2026-03-17-precise-dialogue-localization.md
2026-03-18 11:42:00 +08:00

19 KiB

Precise Dialogue Localization Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Build a high-precision subtitle pipeline that returns accurate sentence boundaries, word-level timings, and real speaker attribution while preserving the current editor flow.

Architecture: Keep the React app and server.ts as the public entry points, but move timing-critical work into a dedicated alignment adapter. The backend normalizes aligned words into sentence subtitles, translates text without changing timing, and returns quality metadata so the editor can enable or disable precision UI safely.

Tech Stack: React 19, TypeScript, Vite, Express, FFmpeg, OpenAI SDK, a new test runner (vitest), and a high-precision alignment backend adapter.


Task 1: Add Test Infrastructure

Files:

  • Modify: E:\Downloads\ai-video-dubbing-&-translation\package.json
  • Create: E:\Downloads\ai-video-dubbing-&-translation\vitest.config.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\test\setup.ts

Step 1: Write the failing test

Create a minimal smoke test first so the test runner has a real target.

import { describe, expect, it } from 'vitest';

describe('test harness', () => {
  it('runs vitest in this workspace', () => {
    expect(true).toBe(true);
  });
});

Step 2: Run test to verify it fails

Run: npm test -- --run Expected: FAIL because no test script or Vitest config exists yet.

Step 3: Write minimal implementation

  1. Add test and test:watch scripts to package.json.
  2. Add dev dependencies for vitest.
  3. Create vitest.config.ts with a Node environment default.
  4. Add src/test/setup.ts for shared setup.
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    environment: 'node',
    setupFiles: ['./src/test/setup.ts'],
  },
});

Step 4: Run test to verify it passes

Run: npm test -- --run Expected: PASS with the smoke test.

Step 5: Commit

git add package.json vitest.config.ts src/test/setup.ts
git commit -m "test: add vitest infrastructure"

Task 2: Extract Subtitle Pipeline Types and Normalizers

Files:

  • Modify: E:\Downloads\ai-video-dubbing-&-translation\src\types.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.test.ts

Step 1: Write the failing test

Write tests for normalization from aligned word payloads to UI-ready subtitles.

it('derives subtitle boundaries from first and last word', () => {
  const result = normalizeAlignedSentence({
    id: 's1',
    speakerId: 'spk_0',
    words: [
      { text: 'Hello', startTime: 1.2, endTime: 1.5, speakerId: 'spk_0', confidence: 0.99 },
      { text: 'world', startTime: 1.6, endTime: 2.0, speakerId: 'spk_0', confidence: 0.98 },
    ],
    originalText: 'Hello world',
    translatedText: '你好世界',
  });

  expect(result.startTime).toBe(1.2);
  expect(result.endTime).toBe(2.0);
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/lib/subtitlePipeline.test.ts Expected: FAIL because the new module and extended types do not exist.

Step 3: Write minimal implementation

  1. Extend Subtitle in src/types.ts with speakerId, words, and confidence.
  2. Create a pure helper module that normalizes backend payloads into frontend subtitles.
export const deriveSubtitleBounds = (words: WordTiming[]) => ({
  startTime: words[0]?.startTime ?? 0,
  endTime: words[words.length - 1]?.endTime ?? 0,
});

Step 4: Run test to verify it passes

Run: npm test -- --run src/lib/subtitlePipeline.test.ts Expected: PASS.

Step 5: Commit

git add src/types.ts src/lib/subtitlePipeline.ts src/lib/subtitlePipeline.test.ts
git commit -m "feat: add subtitle pipeline normalizers"

Task 3: Implement Sentence Reconstruction Helpers

Files:

  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.test.ts

Step 1: Write the failing test

Cover pause splitting and speaker splitting.

it('splits sentences when speaker changes', () => {
  const result = rebuildSentences([
    { text: 'Hi', startTime: 0.0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
    { text: 'there', startTime: 0.25, endTime: 0.5, speakerId: 'spk_0', confidence: 0.9 },
    { text: 'no', startTime: 0.55, endTime: 0.7, speakerId: 'spk_1', confidence: 0.9 },
  ]);

  expect(result).toHaveLength(2);
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts Expected: FAIL because the helper module is missing.

Step 3: Write minimal implementation

Implement pure splitting rules:

  1. Split on speakerId change.
  2. Split when word gaps exceed 0.45.
  3. Split when sentence duration exceeds 8.
if (nextWord.speakerId !== currentSpeakerId) {
  flushSentence();
}

Step 4: Run test to verify it passes

Run: npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts Expected: PASS.

Step 5: Commit

git add src/lib/alignment/sentenceReconstruction.ts src/lib/alignment/sentenceReconstruction.test.ts
git commit -m "feat: add sentence reconstruction rules"

Task 4: Implement Speaker Assignment Helpers

Files:

  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.test.ts

Step 1: Write the failing test

Test overlap-based speaker assignment.

it('assigns each word to the speaker segment with maximum overlap', () => {
  const word = { text: 'hello', startTime: 1.0, endTime: 1.4 };
  const speakers = [
    { speakerId: 'spk_0', startTime: 0.8, endTime: 1.1 },
    { speakerId: 'spk_1', startTime: 1.1, endTime: 1.6 },
  ];

  expect(assignSpeakerToWord(word, speakers)).toBe('spk_1');
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/lib/alignment/speakerAssignment.test.ts Expected: FAIL because speaker assignment logic does not exist.

Step 3: Write minimal implementation

Add a pure overlap calculator and default to unknown when no segment overlaps.

const overlap = Math.max(
  0,
  Math.min(word.endTime, segment.endTime) - Math.max(word.startTime, segment.startTime),
);

Step 4: Run test to verify it passes

Run: npm test -- --run src/lib/alignment/speakerAssignment.test.ts Expected: PASS.

Step 5: Commit

git add src/lib/alignment/speakerAssignment.ts src/lib/alignment/speakerAssignment.test.ts
git commit -m "feat: add speaker assignment helpers"

Task 5: Isolate Backend Pipeline Logic from server.ts

Files:

  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.test.ts
  • Modify: E:\Downloads\ai-video-dubbing-&-translation\server.ts

Step 1: Write the failing test

Add tests for orchestration-level fallback behavior.

it('returns partial quality when diarization is unavailable', async () => {
  const result = await buildSubtitlePayload({
    alignmentResult: {
      words: [{ text: 'hi', startTime: 0, endTime: 0.2, speakerId: 'unknown', confidence: 0.9 }],
      speakerSegments: [],
      quality: 'partial',
    },
  });

  expect(result.quality).toBe('partial');
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/server/subtitlePipeline.test.ts Expected: FAIL because orchestration code is still embedded in server.ts.

Step 3: Write minimal implementation

  1. Move payload-building logic into src/server/subtitlePipeline.ts.
  2. Make server.ts call the helper and only handle HTTP concerns.
export const buildSubtitlePayload = async (deps: SubtitlePipelineDeps) => {
  // normalize alignment result
  // translate text
  // return { subtitles, speakers, quality, ... }
};

Step 4: Run test to verify it passes

Run: npm test -- --run src/server/subtitlePipeline.test.ts Expected: PASS.

Step 5: Commit

git add src/server/subtitlePipeline.ts src/server/subtitlePipeline.test.ts server.ts
git commit -m "refactor: isolate subtitle pipeline orchestration"

Task 6: Add an Alignment Service Adapter

Files:

  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.test.ts
  • Modify: E:\Downloads\ai-video-dubbing-&-translation\server.ts

Step 1: Write the failing test

Test that the adapter maps raw alignment responses into normalized internal types.

it('maps aligned words and speaker segments from the adapter response', async () => {
  const result = await parseAlignmentResponse({
    words: [{ word: 'hello', start: 1.0, end: 1.2, speaker: 'spk_0', score: 0.95 }],
    speakers: [{ speaker: 'spk_0', start: 0.8, end: 1.6 }],
  });

  expect(result.words[0].speakerId).toBe('spk_0');
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/server/alignmentAdapter.test.ts Expected: FAIL because no adapter exists.

Step 3: Write minimal implementation

Create an adapter boundary with one public function such as requestAlignedTranscript(audioPath).

export const requestAlignedTranscript = async (audioPath: string) => {
  // call local or remote alignment backend
  // normalize response shape
};

Step 4: Run test to verify it passes

Run: npm test -- --run src/server/alignmentAdapter.test.ts Expected: PASS.

Step 5: Commit

git add src/server/alignmentAdapter.ts src/server/alignmentAdapter.test.ts server.ts
git commit -m "feat: add alignment service adapter"

Task 7: Upgrade /api/process-audio-pipeline Response Shape

Files:

  • Modify: E:\Downloads\ai-video-dubbing-&-translation\server.ts
  • Modify: E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.test.ts

Step 1: Write the failing test

Add a client-side test for parsing quality, speakers, and words.

it('maps the enriched audio pipeline response into subtitle objects', async () => {
  const payload = {
    subtitles: [
      {
        id: 'sub_1',
        startTime: 1,
        endTime: 2,
        originalText: 'Hello',
        translatedText: '你好',
        speaker: 'Speaker 1',
        speakerId: 'spk_0',
        words: [{ text: 'Hello', startTime: 1, endTime: 2, speakerId: 'spk_0', confidence: 0.9 }],
        confidence: 0.9,
      },
    ],
    speakers: [{ speakerId: 'spk_0', label: 'Speaker 1' }],
    quality: 'full',
  };

  expect(mapPipelineResponse(payload).subtitles[0].words).toHaveLength(1);
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/services/geminiService.test.ts Expected: FAIL because the mapping helper does not exist.

Step 3: Write minimal implementation

  1. Add a response-mapping helper in src/services/geminiService.ts.
  2. Preserve the existing fallback path.
  3. Carry quality metadata to the UI.
const quality = data.quality ?? 'fallback';
const subtitles = (data.subtitles ?? []).map(mapSubtitleFromApi);

Step 4: Run test to verify it passes

Run: npm test -- --run src/services/geminiService.test.ts Expected: PASS.

Step 5: Commit

git add server.ts src/services/geminiService.ts src/services/geminiService.test.ts
git commit -m "feat: return enriched subtitle pipeline payloads"

Task 8: Add Precision Metadata to Editor State

Files:

  • Modify: E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.test.tsx

Step 1: Write the failing test

Add a test for rendering a fallback warning when quality is low.

it('shows a low-precision notice for fallback subtitle results', () => {
  render(<EditorScreen ... />);
  expect(screen.getByText(/low-precision/i)).toBeInTheDocument();
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/components/EditorScreen.test.tsx Expected: FAIL because the component does not track pipeline quality yet.

Step 3: Write minimal implementation

  1. Add state for quality and speakers.
  2. Surface a small status badge or warning banner.
  3. Keep the existing sentence list and timeline intact.
{quality === 'fallback' && (
  <p className="text-xs text-amber-700">Low-precision timing detected. Manual review recommended.</p>
)}

Step 4: Run test to verify it passes

Run: npm test -- --run src/components/EditorScreen.test.tsx Expected: PASS.

Step 5: Commit

git add src/components/EditorScreen.tsx src/components/EditorScreen.test.tsx
git commit -m "feat: surface subtitle precision status in editor"

Task 9: Add Word-Level Playback Helpers

Files:

  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.test.ts
  • Modify: E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx

Step 1: Write the failing test

Test the active-word lookup helper.

it('returns the active word for the current playback time', () => {
  const activeWord = getActiveWord([
    { text: 'Hello', startTime: 1, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
  ], 1.1);

  expect(activeWord?.text).toBe('Hello');
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/lib/playback/wordHighlight.test.ts Expected: FAIL because playback helpers do not exist.

Step 3: Write minimal implementation

  1. Create a pure helper for active-word lookup.
  2. Use it in EditorScreen.tsx to render highlighted word spans when words are present.
export const getActiveWord = (words: WordTiming[], currentTime: number) =>
  words.find((word) => currentTime >= word.startTime && currentTime <= word.endTime);

Step 4: Run test to verify it passes

Run: npm test -- --run src/lib/playback/wordHighlight.test.ts Expected: PASS.

Step 5: Commit

git add src/lib/playback/wordHighlight.ts src/lib/playback/wordHighlight.test.ts src/components/EditorScreen.tsx
git commit -m "feat: add word-level playback highlighting"

Task 10: Snap Timeline Edges to Word Boundaries

Files:

  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.test.ts
  • Modify: E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx

Step 1: Write the failing test

Test snapping to nearest word edges.

it('snaps a dragged start edge to the nearest word boundary', () => {
  const next = snapTimeToNearestWordBoundary(
    1.34,
    [
      { text: 'Hello', startTime: 1.0, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
      { text: 'world', startTime: 1.35, endTime: 1.8, speakerId: 'spk_0', confidence: 0.9 },
    ],
  );

  expect(next).toBe(1.35);
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/lib/timeline/snapToWords.test.ts Expected: FAIL because no snapping helper exists.

Step 3: Write minimal implementation

  1. Add a pure snapping helper with a small tolerance window.
  2. Use it in the left and right resize timeline handlers.
export const snapTimeToNearestWordBoundary = (time: number, words: WordTiming[]) => {
  // choose nearest start or end boundary within tolerance
};

Step 4: Run test to verify it passes

Run: npm test -- --run src/lib/timeline/snapToWords.test.ts Expected: PASS.

Step 5: Commit

git add src/lib/timeline/snapToWords.ts src/lib/timeline/snapToWords.test.ts src/components/EditorScreen.tsx
git commit -m "feat: snap subtitle edits to word boundaries"

Task 11: Add Speaker-Aware UI State

Files:

  • Modify: E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx
  • Modify: E:\Downloads\ai-video-dubbing-&-translation\src\voices.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.ts
  • Create: E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.test.ts

Step 1: Write the failing test

Test stable color and label generation for speaker tracks.

it('creates stable display metadata for each speaker id', () => {
  const speaker = buildSpeakerPresentation({ speakerId: 'spk_0', label: 'Speaker 1' });
  expect(speaker.color).toMatch(/^#/);
});

Step 2: Run test to verify it fails

Run: npm test -- --run src/lib/speakers/speakerPresentation.test.ts Expected: FAIL because no speaker presentation helper exists.

Step 3: Write minimal implementation

  1. Create a helper that derives display color and fallback label from speakerId.
  2. Use it to color sentence chips or timeline items.
  3. Keep voice assignment behavior backward compatible.
export const buildSpeakerPresentation = ({ speakerId, label }: SpeakerTrack) => ({
  speakerId,
  label,
  color: '#1677ff',
});

Step 4: Run test to verify it passes

Run: npm test -- --run src/lib/speakers/speakerPresentation.test.ts Expected: PASS.

Step 5: Commit

git add src/components/EditorScreen.tsx src/voices.ts src/lib/speakers/speakerPresentation.ts src/lib/speakers/speakerPresentation.test.ts
git commit -m "feat: add speaker-aware editor presentation"

Task 12: Verify End-to-End Behavior and Update Docs

Files:

  • Modify: E:\Downloads\ai-video-dubbing-&-translation\README.md
  • Modify: E:\Downloads\ai-video-dubbing-&-translation\docs\plans\2026-03-17-precise-dialogue-localization-design.md

Step 1: Write the failing test

Write down the manual verification checklist before changing docs so the release criteria are explicit.

- [ ] Single-speaker clip returns `quality: full`
- [ ] Two-speaker clip shows distinct speaker IDs
- [ ] Fallback path shows low-precision notice
- [ ] Timeline resize snaps to word boundaries

Step 2: Run test to verify it fails

Run: npm run lint Expected: PASS or FAIL depending on in-progress code, but manual verification is still incomplete until the checklist is executed.

Step 3: Write minimal implementation

  1. Update README.md with new environment requirements and pipeline description.
  2. Record the manual verification results in the design document or a linked note.
## High-Precision Subtitle Mode

Set the alignment backend environment variables before running the app.

Step 4: Run test to verify it passes

Run: npm test -- --run Expected: PASS.

Run: npm run lint Expected: PASS.

Run: npm run build Expected: PASS.

Step 5: Commit

git add README.md docs/plans/2026-03-17-precise-dialogue-localization-design.md
git commit -m "docs: document precise dialogue localization workflow"

Notes for Execution

  1. This workspace currently has no .git directory, so commit steps cannot be executed until the project is placed in a real Git checkout.
  2. Introduce the alignment backend behind environment-based configuration so existing demos can still use the current fallback path.
  3. Prefer pure functions for sentence reconstruction, speaker assignment, snapping, and word-highlighting logic so they remain easy to test.