video_translate/docs/plans/2026-03-17-precise-dialogue-localization.md
2026-03-18 11:42:00 +08:00

651 lines
19 KiB
Markdown

# Precise Dialogue Localization Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Build a high-precision subtitle pipeline that returns accurate sentence boundaries, word-level timings, and real speaker attribution while preserving the current editor flow.
**Architecture:** Keep the React app and `server.ts` as the public entry points, but move timing-critical work into a dedicated alignment adapter. The backend normalizes aligned words into sentence subtitles, translates text without changing timing, and returns quality metadata so the editor can enable or disable precision UI safely.
**Tech Stack:** React 19, TypeScript, Vite, Express, FFmpeg, OpenAI SDK, a new test runner (`vitest`), and a high-precision alignment backend adapter.
---
### Task 1: Add Test Infrastructure
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\package.json`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\vitest.config.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\test\setup.ts`
**Step 1: Write the failing test**
Create a minimal smoke test first so the test runner has a real target.
```ts
import { describe, expect, it } from 'vitest';
describe('test harness', () => {
it('runs vitest in this workspace', () => {
expect(true).toBe(true);
});
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run`
Expected: FAIL because no `test` script or Vitest config exists yet.
**Step 3: Write minimal implementation**
1. Add `test` and `test:watch` scripts to `package.json`.
2. Add dev dependencies for `vitest`.
3. Create `vitest.config.ts` with a Node environment default.
4. Add `src/test/setup.ts` for shared setup.
```ts
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
environment: 'node',
setupFiles: ['./src/test/setup.ts'],
},
});
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run`
Expected: PASS with the smoke test.
**Step 5: Commit**
```bash
git add package.json vitest.config.ts src/test/setup.ts
git commit -m "test: add vitest infrastructure"
```
### Task 2: Extract Subtitle Pipeline Types and Normalizers
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\types.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.test.ts`
**Step 1: Write the failing test**
Write tests for normalization from aligned word payloads to UI-ready subtitles.
```ts
it('derives subtitle boundaries from first and last word', () => {
const result = normalizeAlignedSentence({
id: 's1',
speakerId: 'spk_0',
words: [
{ text: 'Hello', startTime: 1.2, endTime: 1.5, speakerId: 'spk_0', confidence: 0.99 },
{ text: 'world', startTime: 1.6, endTime: 2.0, speakerId: 'spk_0', confidence: 0.98 },
],
originalText: 'Hello world',
translatedText: '你好世界',
});
expect(result.startTime).toBe(1.2);
expect(result.endTime).toBe(2.0);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/subtitlePipeline.test.ts`
Expected: FAIL because the new module and extended types do not exist.
**Step 3: Write minimal implementation**
1. Extend `Subtitle` in `src/types.ts` with `speakerId`, `words`, and `confidence`.
2. Create a pure helper module that normalizes backend payloads into frontend subtitles.
```ts
export const deriveSubtitleBounds = (words: WordTiming[]) => ({
startTime: words[0]?.startTime ?? 0,
endTime: words[words.length - 1]?.endTime ?? 0,
});
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/subtitlePipeline.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/types.ts src/lib/subtitlePipeline.ts src/lib/subtitlePipeline.test.ts
git commit -m "feat: add subtitle pipeline normalizers"
```
### Task 3: Implement Sentence Reconstruction Helpers
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.test.ts`
**Step 1: Write the failing test**
Cover pause splitting and speaker splitting.
```ts
it('splits sentences when speaker changes', () => {
const result = rebuildSentences([
{ text: 'Hi', startTime: 0.0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'there', startTime: 0.25, endTime: 0.5, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'no', startTime: 0.55, endTime: 0.7, speakerId: 'spk_1', confidence: 0.9 },
]);
expect(result).toHaveLength(2);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts`
Expected: FAIL because the helper module is missing.
**Step 3: Write minimal implementation**
Implement pure splitting rules:
1. Split on `speakerId` change.
2. Split when word gaps exceed `0.45`.
3. Split when sentence duration exceeds `8`.
```ts
if (nextWord.speakerId !== currentSpeakerId) {
flushSentence();
}
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/lib/alignment/sentenceReconstruction.ts src/lib/alignment/sentenceReconstruction.test.ts
git commit -m "feat: add sentence reconstruction rules"
```
### Task 4: Implement Speaker Assignment Helpers
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.test.ts`
**Step 1: Write the failing test**
Test overlap-based speaker assignment.
```ts
it('assigns each word to the speaker segment with maximum overlap', () => {
const word = { text: 'hello', startTime: 1.0, endTime: 1.4 };
const speakers = [
{ speakerId: 'spk_0', startTime: 0.8, endTime: 1.1 },
{ speakerId: 'spk_1', startTime: 1.1, endTime: 1.6 },
];
expect(assignSpeakerToWord(word, speakers)).toBe('spk_1');
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/alignment/speakerAssignment.test.ts`
Expected: FAIL because speaker assignment logic does not exist.
**Step 3: Write minimal implementation**
Add a pure overlap calculator and default to `unknown` when no segment overlaps.
```ts
const overlap = Math.max(
0,
Math.min(word.endTime, segment.endTime) - Math.max(word.startTime, segment.startTime),
);
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/alignment/speakerAssignment.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/lib/alignment/speakerAssignment.ts src/lib/alignment/speakerAssignment.test.ts
git commit -m "feat: add speaker assignment helpers"
```
### Task 5: Isolate Backend Pipeline Logic from `server.ts`
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
**Step 1: Write the failing test**
Add tests for orchestration-level fallback behavior.
```ts
it('returns partial quality when diarization is unavailable', async () => {
const result = await buildSubtitlePayload({
alignmentResult: {
words: [{ text: 'hi', startTime: 0, endTime: 0.2, speakerId: 'unknown', confidence: 0.9 }],
speakerSegments: [],
quality: 'partial',
},
});
expect(result.quality).toBe('partial');
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/server/subtitlePipeline.test.ts`
Expected: FAIL because orchestration code is still embedded in `server.ts`.
**Step 3: Write minimal implementation**
1. Move payload-building logic into `src/server/subtitlePipeline.ts`.
2. Make `server.ts` call the helper and only handle HTTP concerns.
```ts
export const buildSubtitlePayload = async (deps: SubtitlePipelineDeps) => {
// normalize alignment result
// translate text
// return { subtitles, speakers, quality, ... }
};
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/server/subtitlePipeline.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/server/subtitlePipeline.ts src/server/subtitlePipeline.test.ts server.ts
git commit -m "refactor: isolate subtitle pipeline orchestration"
```
### Task 6: Add an Alignment Service Adapter
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
**Step 1: Write the failing test**
Test that the adapter maps raw alignment responses into normalized internal types.
```ts
it('maps aligned words and speaker segments from the adapter response', async () => {
const result = await parseAlignmentResponse({
words: [{ word: 'hello', start: 1.0, end: 1.2, speaker: 'spk_0', score: 0.95 }],
speakers: [{ speaker: 'spk_0', start: 0.8, end: 1.6 }],
});
expect(result.words[0].speakerId).toBe('spk_0');
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/server/alignmentAdapter.test.ts`
Expected: FAIL because no adapter exists.
**Step 3: Write minimal implementation**
Create an adapter boundary with one public function such as `requestAlignedTranscript(audioPath)`.
```ts
export const requestAlignedTranscript = async (audioPath: string) => {
// call local or remote alignment backend
// normalize response shape
};
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/server/alignmentAdapter.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/server/alignmentAdapter.ts src/server/alignmentAdapter.test.ts server.ts
git commit -m "feat: add alignment service adapter"
```
### Task 7: Upgrade `/api/process-audio-pipeline` Response Shape
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\server.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.test.ts`
**Step 1: Write the failing test**
Add a client-side test for parsing `quality`, `speakers`, and `words`.
```ts
it('maps the enriched audio pipeline response into subtitle objects', async () => {
const payload = {
subtitles: [
{
id: 'sub_1',
startTime: 1,
endTime: 2,
originalText: 'Hello',
translatedText: '你好',
speaker: 'Speaker 1',
speakerId: 'spk_0',
words: [{ text: 'Hello', startTime: 1, endTime: 2, speakerId: 'spk_0', confidence: 0.9 }],
confidence: 0.9,
},
],
speakers: [{ speakerId: 'spk_0', label: 'Speaker 1' }],
quality: 'full',
};
expect(mapPipelineResponse(payload).subtitles[0].words).toHaveLength(1);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/services/geminiService.test.ts`
Expected: FAIL because the mapping helper does not exist.
**Step 3: Write minimal implementation**
1. Add a response-mapping helper in `src/services/geminiService.ts`.
2. Preserve the existing fallback path.
3. Carry `quality` metadata to the UI.
```ts
const quality = data.quality ?? 'fallback';
const subtitles = (data.subtitles ?? []).map(mapSubtitleFromApi);
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/services/geminiService.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add server.ts src/services/geminiService.ts src/services/geminiService.test.ts
git commit -m "feat: return enriched subtitle pipeline payloads"
```
### Task 8: Add Precision Metadata to Editor State
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.test.tsx`
**Step 1: Write the failing test**
Add a test for rendering a fallback warning when `quality` is low.
```tsx
it('shows a low-precision notice for fallback subtitle results', () => {
render(<EditorScreen ... />);
expect(screen.getByText(/low-precision/i)).toBeInTheDocument();
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/components/EditorScreen.test.tsx`
Expected: FAIL because the component does not track pipeline quality yet.
**Step 3: Write minimal implementation**
1. Add state for `quality` and `speakers`.
2. Surface a small status badge or warning banner.
3. Keep the existing sentence list and timeline intact.
```tsx
{quality === 'fallback' && (
<p className="text-xs text-amber-700">Low-precision timing detected. Manual review recommended.</p>
)}
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/components/EditorScreen.test.tsx`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/components/EditorScreen.tsx src/components/EditorScreen.test.tsx
git commit -m "feat: surface subtitle precision status in editor"
```
### Task 9: Add Word-Level Playback Helpers
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
**Step 1: Write the failing test**
Test the active-word lookup helper.
```ts
it('returns the active word for the current playback time', () => {
const activeWord = getActiveWord([
{ text: 'Hello', startTime: 1, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
], 1.1);
expect(activeWord?.text).toBe('Hello');
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/playback/wordHighlight.test.ts`
Expected: FAIL because playback helpers do not exist.
**Step 3: Write minimal implementation**
1. Create a pure helper for active-word lookup.
2. Use it in `EditorScreen.tsx` to render highlighted word spans when `words` are present.
```ts
export const getActiveWord = (words: WordTiming[], currentTime: number) =>
words.find((word) => currentTime >= word.startTime && currentTime <= word.endTime);
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/playback/wordHighlight.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/lib/playback/wordHighlight.ts src/lib/playback/wordHighlight.test.ts src/components/EditorScreen.tsx
git commit -m "feat: add word-level playback highlighting"
```
### Task 10: Snap Timeline Edges to Word Boundaries
**Files:**
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.test.ts`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
**Step 1: Write the failing test**
Test snapping to nearest word edges.
```ts
it('snaps a dragged start edge to the nearest word boundary', () => {
const next = snapTimeToNearestWordBoundary(
1.34,
[
{ text: 'Hello', startTime: 1.0, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'world', startTime: 1.35, endTime: 1.8, speakerId: 'spk_0', confidence: 0.9 },
],
);
expect(next).toBe(1.35);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/timeline/snapToWords.test.ts`
Expected: FAIL because no snapping helper exists.
**Step 3: Write minimal implementation**
1. Add a pure snapping helper with a small tolerance window.
2. Use it in the left and right resize timeline handlers.
```ts
export const snapTimeToNearestWordBoundary = (time: number, words: WordTiming[]) => {
// choose nearest start or end boundary within tolerance
};
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/timeline/snapToWords.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/lib/timeline/snapToWords.ts src/lib/timeline/snapToWords.test.ts src/components/EditorScreen.tsx
git commit -m "feat: snap subtitle edits to word boundaries"
```
### Task 11: Add Speaker-Aware UI State
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\src\voices.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.ts`
- Create: `E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.test.ts`
**Step 1: Write the failing test**
Test stable color and label generation for speaker tracks.
```ts
it('creates stable display metadata for each speaker id', () => {
const speaker = buildSpeakerPresentation({ speakerId: 'spk_0', label: 'Speaker 1' });
expect(speaker.color).toMatch(/^#/);
});
```
**Step 2: Run test to verify it fails**
Run: `npm test -- --run src/lib/speakers/speakerPresentation.test.ts`
Expected: FAIL because no speaker presentation helper exists.
**Step 3: Write minimal implementation**
1. Create a helper that derives display color and fallback label from `speakerId`.
2. Use it to color sentence chips or timeline items.
3. Keep voice assignment behavior backward compatible.
```ts
export const buildSpeakerPresentation = ({ speakerId, label }: SpeakerTrack) => ({
speakerId,
label,
color: '#1677ff',
});
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run src/lib/speakers/speakerPresentation.test.ts`
Expected: PASS.
**Step 5: Commit**
```bash
git add src/components/EditorScreen.tsx src/voices.ts src/lib/speakers/speakerPresentation.ts src/lib/speakers/speakerPresentation.test.ts
git commit -m "feat: add speaker-aware editor presentation"
```
### Task 12: Verify End-to-End Behavior and Update Docs
**Files:**
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\README.md`
- Modify: `E:\Downloads\ai-video-dubbing-&-translation\docs\plans\2026-03-17-precise-dialogue-localization-design.md`
**Step 1: Write the failing test**
Write down the manual verification checklist before changing docs so the release criteria are explicit.
```md
- [ ] Single-speaker clip returns `quality: full`
- [ ] Two-speaker clip shows distinct speaker IDs
- [ ] Fallback path shows low-precision notice
- [ ] Timeline resize snaps to word boundaries
```
**Step 2: Run test to verify it fails**
Run: `npm run lint`
Expected: PASS or FAIL depending on in-progress code, but manual verification is still incomplete until the checklist is executed.
**Step 3: Write minimal implementation**
1. Update `README.md` with new environment requirements and pipeline description.
2. Record the manual verification results in the design document or a linked note.
```md
## High-Precision Subtitle Mode
Set the alignment backend environment variables before running the app.
```
**Step 4: Run test to verify it passes**
Run: `npm test -- --run`
Expected: PASS.
Run: `npm run lint`
Expected: PASS.
Run: `npm run build`
Expected: PASS.
**Step 5: Commit**
```bash
git add README.md docs/plans/2026-03-17-precise-dialogue-localization-design.md
git commit -m "docs: document precise dialogue localization workflow"
```
## Notes for Execution
1. This workspace currently has no `.git` directory, so commit steps cannot be executed until the project is placed in a real Git checkout.
2. Introduce the alignment backend behind environment-based configuration so existing demos can still use the current fallback path.
3. Prefer pure functions for sentence reconstruction, speaker assignment, snapping, and word-highlighting logic so they remain easy to test.