19 KiB
Precise Dialogue Localization Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Build a high-precision subtitle pipeline that returns accurate sentence boundaries, word-level timings, and real speaker attribution while preserving the current editor flow.
Architecture: Keep the React app and server.ts as the public entry points, but move timing-critical work into a dedicated alignment adapter. The backend normalizes aligned words into sentence subtitles, translates text without changing timing, and returns quality metadata so the editor can enable or disable precision UI safely.
Tech Stack: React 19, TypeScript, Vite, Express, FFmpeg, OpenAI SDK, a new test runner (vitest), and a high-precision alignment backend adapter.
Task 1: Add Test Infrastructure
Files:
- Modify:
E:\Downloads\ai-video-dubbing-&-translation\package.json - Create:
E:\Downloads\ai-video-dubbing-&-translation\vitest.config.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\test\setup.ts
Step 1: Write the failing test
Create a minimal smoke test first so the test runner has a real target.
import { describe, expect, it } from 'vitest';
describe('test harness', () => {
it('runs vitest in this workspace', () => {
expect(true).toBe(true);
});
});
Step 2: Run test to verify it fails
Run: npm test -- --run
Expected: FAIL because no test script or Vitest config exists yet.
Step 3: Write minimal implementation
- Add
testandtest:watchscripts topackage.json. - Add dev dependencies for
vitest. - Create
vitest.config.tswith a Node environment default. - Add
src/test/setup.tsfor shared setup.
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
environment: 'node',
setupFiles: ['./src/test/setup.ts'],
},
});
Step 4: Run test to verify it passes
Run: npm test -- --run
Expected: PASS with the smoke test.
Step 5: Commit
git add package.json vitest.config.ts src/test/setup.ts
git commit -m "test: add vitest infrastructure"
Task 2: Extract Subtitle Pipeline Types and Normalizers
Files:
- Modify:
E:\Downloads\ai-video-dubbing-&-translation\src\types.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\subtitlePipeline.test.ts
Step 1: Write the failing test
Write tests for normalization from aligned word payloads to UI-ready subtitles.
it('derives subtitle boundaries from first and last word', () => {
const result = normalizeAlignedSentence({
id: 's1',
speakerId: 'spk_0',
words: [
{ text: 'Hello', startTime: 1.2, endTime: 1.5, speakerId: 'spk_0', confidence: 0.99 },
{ text: 'world', startTime: 1.6, endTime: 2.0, speakerId: 'spk_0', confidence: 0.98 },
],
originalText: 'Hello world',
translatedText: '你好世界',
});
expect(result.startTime).toBe(1.2);
expect(result.endTime).toBe(2.0);
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/lib/subtitlePipeline.test.ts
Expected: FAIL because the new module and extended types do not exist.
Step 3: Write minimal implementation
- Extend
Subtitleinsrc/types.tswithspeakerId,words, andconfidence. - Create a pure helper module that normalizes backend payloads into frontend subtitles.
export const deriveSubtitleBounds = (words: WordTiming[]) => ({
startTime: words[0]?.startTime ?? 0,
endTime: words[words.length - 1]?.endTime ?? 0,
});
Step 4: Run test to verify it passes
Run: npm test -- --run src/lib/subtitlePipeline.test.ts
Expected: PASS.
Step 5: Commit
git add src/types.ts src/lib/subtitlePipeline.ts src/lib/subtitlePipeline.test.ts
git commit -m "feat: add subtitle pipeline normalizers"
Task 3: Implement Sentence Reconstruction Helpers
Files:
- Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\sentenceReconstruction.test.ts
Step 1: Write the failing test
Cover pause splitting and speaker splitting.
it('splits sentences when speaker changes', () => {
const result = rebuildSentences([
{ text: 'Hi', startTime: 0.0, endTime: 0.2, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'there', startTime: 0.25, endTime: 0.5, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'no', startTime: 0.55, endTime: 0.7, speakerId: 'spk_1', confidence: 0.9 },
]);
expect(result).toHaveLength(2);
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts
Expected: FAIL because the helper module is missing.
Step 3: Write minimal implementation
Implement pure splitting rules:
- Split on
speakerIdchange. - Split when word gaps exceed
0.45. - Split when sentence duration exceeds
8.
if (nextWord.speakerId !== currentSpeakerId) {
flushSentence();
}
Step 4: Run test to verify it passes
Run: npm test -- --run src/lib/alignment/sentenceReconstruction.test.ts
Expected: PASS.
Step 5: Commit
git add src/lib/alignment/sentenceReconstruction.ts src/lib/alignment/sentenceReconstruction.test.ts
git commit -m "feat: add sentence reconstruction rules"
Task 4: Implement Speaker Assignment Helpers
Files:
- Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\alignment\speakerAssignment.test.ts
Step 1: Write the failing test
Test overlap-based speaker assignment.
it('assigns each word to the speaker segment with maximum overlap', () => {
const word = { text: 'hello', startTime: 1.0, endTime: 1.4 };
const speakers = [
{ speakerId: 'spk_0', startTime: 0.8, endTime: 1.1 },
{ speakerId: 'spk_1', startTime: 1.1, endTime: 1.6 },
];
expect(assignSpeakerToWord(word, speakers)).toBe('spk_1');
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/lib/alignment/speakerAssignment.test.ts
Expected: FAIL because speaker assignment logic does not exist.
Step 3: Write minimal implementation
Add a pure overlap calculator and default to unknown when no segment overlaps.
const overlap = Math.max(
0,
Math.min(word.endTime, segment.endTime) - Math.max(word.startTime, segment.startTime),
);
Step 4: Run test to verify it passes
Run: npm test -- --run src/lib/alignment/speakerAssignment.test.ts
Expected: PASS.
Step 5: Commit
git add src/lib/alignment/speakerAssignment.ts src/lib/alignment/speakerAssignment.test.ts
git commit -m "feat: add speaker assignment helpers"
Task 5: Isolate Backend Pipeline Logic from server.ts
Files:
- Create:
E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\server\subtitlePipeline.test.ts - Modify:
E:\Downloads\ai-video-dubbing-&-translation\server.ts
Step 1: Write the failing test
Add tests for orchestration-level fallback behavior.
it('returns partial quality when diarization is unavailable', async () => {
const result = await buildSubtitlePayload({
alignmentResult: {
words: [{ text: 'hi', startTime: 0, endTime: 0.2, speakerId: 'unknown', confidence: 0.9 }],
speakerSegments: [],
quality: 'partial',
},
});
expect(result.quality).toBe('partial');
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/server/subtitlePipeline.test.ts
Expected: FAIL because orchestration code is still embedded in server.ts.
Step 3: Write minimal implementation
- Move payload-building logic into
src/server/subtitlePipeline.ts. - Make
server.tscall the helper and only handle HTTP concerns.
export const buildSubtitlePayload = async (deps: SubtitlePipelineDeps) => {
// normalize alignment result
// translate text
// return { subtitles, speakers, quality, ... }
};
Step 4: Run test to verify it passes
Run: npm test -- --run src/server/subtitlePipeline.test.ts
Expected: PASS.
Step 5: Commit
git add src/server/subtitlePipeline.ts src/server/subtitlePipeline.test.ts server.ts
git commit -m "refactor: isolate subtitle pipeline orchestration"
Task 6: Add an Alignment Service Adapter
Files:
- Create:
E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\server\alignmentAdapter.test.ts - Modify:
E:\Downloads\ai-video-dubbing-&-translation\server.ts
Step 1: Write the failing test
Test that the adapter maps raw alignment responses into normalized internal types.
it('maps aligned words and speaker segments from the adapter response', async () => {
const result = await parseAlignmentResponse({
words: [{ word: 'hello', start: 1.0, end: 1.2, speaker: 'spk_0', score: 0.95 }],
speakers: [{ speaker: 'spk_0', start: 0.8, end: 1.6 }],
});
expect(result.words[0].speakerId).toBe('spk_0');
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/server/alignmentAdapter.test.ts
Expected: FAIL because no adapter exists.
Step 3: Write minimal implementation
Create an adapter boundary with one public function such as requestAlignedTranscript(audioPath).
export const requestAlignedTranscript = async (audioPath: string) => {
// call local or remote alignment backend
// normalize response shape
};
Step 4: Run test to verify it passes
Run: npm test -- --run src/server/alignmentAdapter.test.ts
Expected: PASS.
Step 5: Commit
git add src/server/alignmentAdapter.ts src/server/alignmentAdapter.test.ts server.ts
git commit -m "feat: add alignment service adapter"
Task 7: Upgrade /api/process-audio-pipeline Response Shape
Files:
- Modify:
E:\Downloads\ai-video-dubbing-&-translation\server.ts - Modify:
E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\services\geminiService.test.ts
Step 1: Write the failing test
Add a client-side test for parsing quality, speakers, and words.
it('maps the enriched audio pipeline response into subtitle objects', async () => {
const payload = {
subtitles: [
{
id: 'sub_1',
startTime: 1,
endTime: 2,
originalText: 'Hello',
translatedText: '你好',
speaker: 'Speaker 1',
speakerId: 'spk_0',
words: [{ text: 'Hello', startTime: 1, endTime: 2, speakerId: 'spk_0', confidence: 0.9 }],
confidence: 0.9,
},
],
speakers: [{ speakerId: 'spk_0', label: 'Speaker 1' }],
quality: 'full',
};
expect(mapPipelineResponse(payload).subtitles[0].words).toHaveLength(1);
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/services/geminiService.test.ts
Expected: FAIL because the mapping helper does not exist.
Step 3: Write minimal implementation
- Add a response-mapping helper in
src/services/geminiService.ts. - Preserve the existing fallback path.
- Carry
qualitymetadata to the UI.
const quality = data.quality ?? 'fallback';
const subtitles = (data.subtitles ?? []).map(mapSubtitleFromApi);
Step 4: Run test to verify it passes
Run: npm test -- --run src/services/geminiService.test.ts
Expected: PASS.
Step 5: Commit
git add server.ts src/services/geminiService.ts src/services/geminiService.test.ts
git commit -m "feat: return enriched subtitle pipeline payloads"
Task 8: Add Precision Metadata to Editor State
Files:
- Modify:
E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.test.tsx
Step 1: Write the failing test
Add a test for rendering a fallback warning when quality is low.
it('shows a low-precision notice for fallback subtitle results', () => {
render(<EditorScreen ... />);
expect(screen.getByText(/low-precision/i)).toBeInTheDocument();
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/components/EditorScreen.test.tsx
Expected: FAIL because the component does not track pipeline quality yet.
Step 3: Write minimal implementation
- Add state for
qualityandspeakers. - Surface a small status badge or warning banner.
- Keep the existing sentence list and timeline intact.
{quality === 'fallback' && (
<p className="text-xs text-amber-700">Low-precision timing detected. Manual review recommended.</p>
)}
Step 4: Run test to verify it passes
Run: npm test -- --run src/components/EditorScreen.test.tsx
Expected: PASS.
Step 5: Commit
git add src/components/EditorScreen.tsx src/components/EditorScreen.test.tsx
git commit -m "feat: surface subtitle precision status in editor"
Task 9: Add Word-Level Playback Helpers
Files:
- Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\playback\wordHighlight.test.ts - Modify:
E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx
Step 1: Write the failing test
Test the active-word lookup helper.
it('returns the active word for the current playback time', () => {
const activeWord = getActiveWord([
{ text: 'Hello', startTime: 1, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
], 1.1);
expect(activeWord?.text).toBe('Hello');
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/lib/playback/wordHighlight.test.ts
Expected: FAIL because playback helpers do not exist.
Step 3: Write minimal implementation
- Create a pure helper for active-word lookup.
- Use it in
EditorScreen.tsxto render highlighted word spans whenwordsare present.
export const getActiveWord = (words: WordTiming[], currentTime: number) =>
words.find((word) => currentTime >= word.startTime && currentTime <= word.endTime);
Step 4: Run test to verify it passes
Run: npm test -- --run src/lib/playback/wordHighlight.test.ts
Expected: PASS.
Step 5: Commit
git add src/lib/playback/wordHighlight.ts src/lib/playback/wordHighlight.test.ts src/components/EditorScreen.tsx
git commit -m "feat: add word-level playback highlighting"
Task 10: Snap Timeline Edges to Word Boundaries
Files:
- Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\timeline\snapToWords.test.ts - Modify:
E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx
Step 1: Write the failing test
Test snapping to nearest word edges.
it('snaps a dragged start edge to the nearest word boundary', () => {
const next = snapTimeToNearestWordBoundary(
1.34,
[
{ text: 'Hello', startTime: 1.0, endTime: 1.3, speakerId: 'spk_0', confidence: 0.9 },
{ text: 'world', startTime: 1.35, endTime: 1.8, speakerId: 'spk_0', confidence: 0.9 },
],
);
expect(next).toBe(1.35);
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/lib/timeline/snapToWords.test.ts
Expected: FAIL because no snapping helper exists.
Step 3: Write minimal implementation
- Add a pure snapping helper with a small tolerance window.
- Use it in the left and right resize timeline handlers.
export const snapTimeToNearestWordBoundary = (time: number, words: WordTiming[]) => {
// choose nearest start or end boundary within tolerance
};
Step 4: Run test to verify it passes
Run: npm test -- --run src/lib/timeline/snapToWords.test.ts
Expected: PASS.
Step 5: Commit
git add src/lib/timeline/snapToWords.ts src/lib/timeline/snapToWords.test.ts src/components/EditorScreen.tsx
git commit -m "feat: snap subtitle edits to word boundaries"
Task 11: Add Speaker-Aware UI State
Files:
- Modify:
E:\Downloads\ai-video-dubbing-&-translation\src\components\EditorScreen.tsx - Modify:
E:\Downloads\ai-video-dubbing-&-translation\src\voices.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.ts - Create:
E:\Downloads\ai-video-dubbing-&-translation\src\lib\speakers\speakerPresentation.test.ts
Step 1: Write the failing test
Test stable color and label generation for speaker tracks.
it('creates stable display metadata for each speaker id', () => {
const speaker = buildSpeakerPresentation({ speakerId: 'spk_0', label: 'Speaker 1' });
expect(speaker.color).toMatch(/^#/);
});
Step 2: Run test to verify it fails
Run: npm test -- --run src/lib/speakers/speakerPresentation.test.ts
Expected: FAIL because no speaker presentation helper exists.
Step 3: Write minimal implementation
- Create a helper that derives display color and fallback label from
speakerId. - Use it to color sentence chips or timeline items.
- Keep voice assignment behavior backward compatible.
export const buildSpeakerPresentation = ({ speakerId, label }: SpeakerTrack) => ({
speakerId,
label,
color: '#1677ff',
});
Step 4: Run test to verify it passes
Run: npm test -- --run src/lib/speakers/speakerPresentation.test.ts
Expected: PASS.
Step 5: Commit
git add src/components/EditorScreen.tsx src/voices.ts src/lib/speakers/speakerPresentation.ts src/lib/speakers/speakerPresentation.test.ts
git commit -m "feat: add speaker-aware editor presentation"
Task 12: Verify End-to-End Behavior and Update Docs
Files:
- Modify:
E:\Downloads\ai-video-dubbing-&-translation\README.md - Modify:
E:\Downloads\ai-video-dubbing-&-translation\docs\plans\2026-03-17-precise-dialogue-localization-design.md
Step 1: Write the failing test
Write down the manual verification checklist before changing docs so the release criteria are explicit.
- [ ] Single-speaker clip returns `quality: full`
- [ ] Two-speaker clip shows distinct speaker IDs
- [ ] Fallback path shows low-precision notice
- [ ] Timeline resize snaps to word boundaries
Step 2: Run test to verify it fails
Run: npm run lint
Expected: PASS or FAIL depending on in-progress code, but manual verification is still incomplete until the checklist is executed.
Step 3: Write minimal implementation
- Update
README.mdwith new environment requirements and pipeline description. - Record the manual verification results in the design document or a linked note.
## High-Precision Subtitle Mode
Set the alignment backend environment variables before running the app.
Step 4: Run test to verify it passes
Run: npm test -- --run
Expected: PASS.
Run: npm run lint
Expected: PASS.
Run: npm run build
Expected: PASS.
Step 5: Commit
git add README.md docs/plans/2026-03-17-precise-dialogue-localization-design.md
git commit -m "docs: document precise dialogue localization workflow"
Notes for Execution
- This workspace currently has no
.gitdirectory, so commit steps cannot be executed until the project is placed in a real Git checkout. - Introduce the alignment backend behind environment-based configuration so existing demos can still use the current fallback path.
- Prefer pure functions for sentence reconstruction, speaker assignment, snapping, and word-highlighting logic so they remain easy to test.