video_translate/docs/plans/2026-03-17-doubao-llm-provider-design.md

# Doubao LLM Provider Design

**Date:** 2026-03-17

**Goal:** Add a user-visible LLM switcher so subtitle generation can use either Doubao or Gemini, default to Doubao, and keep TTS fixed on MiniMax.

## Current State

The current project is effectively Gemini-only for subtitle generation and translation.

1. `src/services/geminiService.ts` calls Gemini directly from the browser for subtitle generation and Gemini fallback TTS.
2. `src/server/geminiTranslation.ts` translates sentence text on the server with Gemini.
3. `src/server/audioPipelineConfig.ts` only validates `GEMINI_API_KEY`.
4. `src/components/EditorScreen.tsx` imports a Gemini-specific service and has no model selector.
5. MiniMax is already independent and used only for TTS through `/api/tts`.

This makes provider switching hard because the LLM choice is not isolated behind a shared contract.

## Product Requirements

1. The editor must show a visible LLM selector.
2. Available LLM options are `Doubao` and `Gemini`.
3. The default LLM must be `Doubao`.
4. TTS must remain fixed to MiniMax and must not participate in provider switching.
5. API keys must only come from `.env`.
6. The app must not silently fall back from one LLM provider to the other.

## Chosen Approach

Use a server-side provider abstraction for subtitle generation and translation, with a frontend selector that passes the chosen provider to the server.

This approach keeps secrets on the server, avoids browser-side provider drift, and gives the project one place to add or change LLM providers later.

## Why This Approach

### Option A: Server-side provider abstraction with frontend selector

Recommended.

1. Frontend sends `provider: 'doubao' | 'gemini'`.
2. Server reads the matching API key from `.env`.
3. Server routes subtitle text generation through a provider adapter.
4. Time-critical audio extraction and timeline logic stay outside the provider-specific layer.

Pros:

1. Keeps API keys off the client.
2. Produces one consistent API contract for the editor.
3. Makes default-provider behavior easy to enforce.
4. Prevents Gemini-specific code from leaking further into the app.

Cons:

1. Requires moving browser-side subtitle generation behavior into a server-owned path.
2. Touches both frontend and backend.

### Option B: Keep Gemini in the browser and add Doubao as a separate server path

Rejected.

Pros:

1. Faster initial implementation.

Cons:

1. Two subtitle-generation architectures would coexist.
2. Provider behavior would drift over time.
3. It violates the requirement that keys come only from `.env`.

### Option C: Client-side provider switching

Rejected.

Pros:

1. Minimal backend work.

Cons:

1. Exposes secrets to the browser.
2. Conflicts with the `.env`-only requirement.

## Architecture

### Frontend

The editor adds an `LLM` selector with the values:

1. `Doubao`
2. `Gemini`

The default selected value is `Doubao`.

When the user clicks subtitle generation, the frontend sends:

1. the uploaded video
2. the target language
3. the selected LLM provider
4. optional trim metadata if the current flow needs it

The frontend no longer needs to know how Gemini or Doubao are called. It only consumes a normalized subtitle payload.

### Server

The server becomes the single owner of LLM subtitle generation.

Responsibilities:

1. validate the incoming provider
2. read provider credentials from `.env`
3. extract audio and prepare subtitle-generation inputs
4. call the chosen provider adapter
5. normalize the result into the existing subtitle shape

### Provider Layer

Create a provider abstraction around LLM calls:

1. `resolveLlmProvider(provider, env)`
2. `geminiProvider`
3. `doubaoProvider`

Each provider must accept the same logical input and return the same logical output so the rest of the app is provider-agnostic.

## API Design

Add a dedicated subtitle-generation endpoint rather than overloading the existing audio-extraction endpoint.

### Request

`POST /api/generate-subtitles`

Multipart or JSON payload fields:

1. `video`
2. `targetLanguage`
3. `provider`
4. optional `trimRange`

### Response

Return the same normalized subtitle structure the editor already understands.

At minimum each subtitle object should include:

1. `id`
2. `startTime`
3. `endTime`
4. `originalText`
5. `translatedText`
6. `speaker`
7. `voiceId`
8. `volume`

If richer timeline metadata already exists in the current server subtitle pipeline, keep it in the response rather than trimming it away.

## Subtitle Generation Strategy

The provider switch should affect LLM reasoning, not TTS and not the MiniMax path.

The cleanest boundary is:

1. audio extraction and timeline preparation stay on the server
2. LLM provider handles translation and label generation
3. MiniMax remains the only TTS engine

This reduces the risk that switching providers changes subtitle timing behavior unpredictably.

## Doubao Integration Notes

Use the Ark Responses API on the server:

1. host: `https://ark.cn-beijing.volces.com/api/v3/responses`
2. auth: `Authorization: Bearer ${ARK_API_KEY}`
3. model: configurable, defaulting to `doubao-seed-2-0-pro-260215`

The provider should treat Doubao as a text-generation backend and extract normalized text from the response payload before JSON parsing.

Implementation detail:

1. the response parser should not assume SDK-specific helpers
2. it should read the returned response envelope and collect the textual output fragments
3. the final result should be parsed as JSON only after the output text is reconstructed

This is an implementation inference based on the official Ark Responses API response shape and is meant to keep the parser resilient to wrapper differences.

## Configuration

Environment variables:

1. `ARK_API_KEY` for Doubao
2. `GEMINI_API_KEY` for Gemini
3. `MINIMAX_API_KEY` for TTS
4. optional `DOUBAO_MODEL` for server-side model override
5. optional `DEFAULT_LLM_PROVIDER` with a default value of `doubao`

Rules:

1. No API keys may be embedded in frontend code.
2. No provider may silently reuse another provider's key.
3. If the selected provider is missing its key, return a clear error.

## Error Handling

Provider failures must be explicit.

1. If `provider` is invalid, return `400`.
2. If the selected provider key is missing, return `400`.
3. If the selected provider returns an auth failure, return `401` or a mapped upstream auth error.
4. If the selected provider fails unexpectedly, return `502` or `500` with a provider-specific error message.
5. Do not auto-fallback from Doubao to Gemini or from Gemini to Doubao.

The UI should show which provider failed so the user is never misled about which model generated a subtitle result.

## Frontend UX

Add the selector in the editor near the subtitle-generation controls so the choice is visible at generation time.

Rules:

1. Default selection is `Doubao`.
2. The selector affects each generation request immediately.
3. The selector does not affect previously generated subtitles until the user regenerates.
4. The selector does not affect MiniMax TTS generation.

## Testing Strategy

Coverage should focus on deterministic seams.

1. Provider resolution defaults to Doubao.
2. Invalid provider is rejected.
3. Missing `ARK_API_KEY` or `GEMINI_API_KEY` returns clear errors.
4. Doubao response parsing turns Ark response content into normalized subtitle JSON.
5. Gemini and Doubao providers both satisfy the same interface contract.
6. Editor defaults to Doubao and sends the selected provider on regenerate.
7. TTS behavior remains unchanged when the LLM provider changes.

## Rollout Notes

1. Introduce the new endpoint and provider abstraction first.
2. Switch the editor to the new endpoint second.
3. Keep MiniMax TTS untouched except for regression checks.
4. Leave any deeper visual fallback provider work for a later pass if needed.

## Constraints

1. This workspace is not a Git repository, so the design document cannot be committed here.
2. The user provided an Ark key in chat, but the implementation must still read provider secrets from `.env` and not hardcode them into source files.