video_translate/docs/plans/2026-03-17-doubao-llm-provider-design.md
2026-03-18 11:42:00 +08:00

7.9 KiB

Doubao LLM Provider Design

Date: 2026-03-17

Goal: Add a user-visible LLM switcher so subtitle generation can use either Doubao or Gemini, default to Doubao, and keep TTS fixed on MiniMax.

Current State

The current project is effectively Gemini-only for subtitle generation and translation.

  1. src/services/geminiService.ts calls Gemini directly from the browser for subtitle generation and Gemini fallback TTS.
  2. src/server/geminiTranslation.ts translates sentence text on the server with Gemini.
  3. src/server/audioPipelineConfig.ts only validates GEMINI_API_KEY.
  4. src/components/EditorScreen.tsx imports a Gemini-specific service and has no model selector.
  5. MiniMax is already independent and used only for TTS through /api/tts.

This makes provider switching hard because the LLM choice is not isolated behind a shared contract.

Product Requirements

  1. The editor must show a visible LLM selector.
  2. Available LLM options are Doubao and Gemini.
  3. The default LLM must be Doubao.
  4. TTS must remain fixed to MiniMax and must not participate in provider switching.
  5. API keys must only come from .env.
  6. The app must not silently fall back from one LLM provider to the other.

Chosen Approach

Use a server-side provider abstraction for subtitle generation and translation, with a frontend selector that passes the chosen provider to the server.

This approach keeps secrets on the server, avoids browser-side provider drift, and gives the project one place to add or change LLM providers later.

Why This Approach

Option A: Server-side provider abstraction with frontend selector

Recommended.

  1. Frontend sends provider: 'doubao' | 'gemini'.
  2. Server reads the matching API key from .env.
  3. Server routes subtitle text generation through a provider adapter.
  4. Time-critical audio extraction and timeline logic stay outside the provider-specific layer.

Pros:

  1. Keeps API keys off the client.
  2. Produces one consistent API contract for the editor.
  3. Makes default-provider behavior easy to enforce.
  4. Prevents Gemini-specific code from leaking further into the app.

Cons:

  1. Requires moving browser-side subtitle generation behavior into a server-owned path.
  2. Touches both frontend and backend.

Option B: Keep Gemini in the browser and add Doubao as a separate server path

Rejected.

Pros:

  1. Faster initial implementation.

Cons:

  1. Two subtitle-generation architectures would coexist.
  2. Provider behavior would drift over time.
  3. It violates the requirement that keys come only from .env.

Option C: Client-side provider switching

Rejected.

Pros:

  1. Minimal backend work.

Cons:

  1. Exposes secrets to the browser.
  2. Conflicts with the .env-only requirement.

Architecture

Frontend

The editor adds an LLM selector with the values:

  1. Doubao
  2. Gemini

The default selected value is Doubao.

When the user clicks subtitle generation, the frontend sends:

  1. the uploaded video
  2. the target language
  3. the selected LLM provider
  4. optional trim metadata if the current flow needs it

The frontend no longer needs to know how Gemini or Doubao are called. It only consumes a normalized subtitle payload.

Server

The server becomes the single owner of LLM subtitle generation.

Responsibilities:

  1. validate the incoming provider
  2. read provider credentials from .env
  3. extract audio and prepare subtitle-generation inputs
  4. call the chosen provider adapter
  5. normalize the result into the existing subtitle shape

Provider Layer

Create a provider abstraction around LLM calls:

  1. resolveLlmProvider(provider, env)
  2. geminiProvider
  3. doubaoProvider

Each provider must accept the same logical input and return the same logical output so the rest of the app is provider-agnostic.

API Design

Add a dedicated subtitle-generation endpoint rather than overloading the existing audio-extraction endpoint.

Request

POST /api/generate-subtitles

Multipart or JSON payload fields:

  1. video
  2. targetLanguage
  3. provider
  4. optional trimRange

Response

Return the same normalized subtitle structure the editor already understands.

At minimum each subtitle object should include:

  1. id
  2. startTime
  3. endTime
  4. originalText
  5. translatedText
  6. speaker
  7. voiceId
  8. volume

If richer timeline metadata already exists in the current server subtitle pipeline, keep it in the response rather than trimming it away.

Subtitle Generation Strategy

The provider switch should affect LLM reasoning, not TTS and not the MiniMax path.

The cleanest boundary is:

  1. audio extraction and timeline preparation stay on the server
  2. LLM provider handles translation and label generation
  3. MiniMax remains the only TTS engine

This reduces the risk that switching providers changes subtitle timing behavior unpredictably.

Doubao Integration Notes

Use the Ark Responses API on the server:

  1. host: https://ark.cn-beijing.volces.com/api/v3/responses
  2. auth: Authorization: Bearer ${ARK_API_KEY}
  3. model: configurable, defaulting to doubao-seed-2-0-pro-260215

The provider should treat Doubao as a text-generation backend and extract normalized text from the response payload before JSON parsing.

Implementation detail:

  1. the response parser should not assume SDK-specific helpers
  2. it should read the returned response envelope and collect the textual output fragments
  3. the final result should be parsed as JSON only after the output text is reconstructed

This is an implementation inference based on the official Ark Responses API response shape and is meant to keep the parser resilient to wrapper differences.

Configuration

Environment variables:

  1. ARK_API_KEY for Doubao
  2. GEMINI_API_KEY for Gemini
  3. MINIMAX_API_KEY for TTS
  4. optional DOUBAO_MODEL for server-side model override
  5. optional DEFAULT_LLM_PROVIDER with a default value of doubao

Rules:

  1. No API keys may be embedded in frontend code.
  2. No provider may silently reuse another provider's key.
  3. If the selected provider is missing its key, return a clear error.

Error Handling

Provider failures must be explicit.

  1. If provider is invalid, return 400.
  2. If the selected provider key is missing, return 400.
  3. If the selected provider returns an auth failure, return 401 or a mapped upstream auth error.
  4. If the selected provider fails unexpectedly, return 502 or 500 with a provider-specific error message.
  5. Do not auto-fallback from Doubao to Gemini or from Gemini to Doubao.

The UI should show which provider failed so the user is never misled about which model generated a subtitle result.

Frontend UX

Add the selector in the editor near the subtitle-generation controls so the choice is visible at generation time.

Rules:

  1. Default selection is Doubao.
  2. The selector affects each generation request immediately.
  3. The selector does not affect previously generated subtitles until the user regenerates.
  4. The selector does not affect MiniMax TTS generation.

Testing Strategy

Coverage should focus on deterministic seams.

  1. Provider resolution defaults to Doubao.
  2. Invalid provider is rejected.
  3. Missing ARK_API_KEY or GEMINI_API_KEY returns clear errors.
  4. Doubao response parsing turns Ark response content into normalized subtitle JSON.
  5. Gemini and Doubao providers both satisfy the same interface contract.
  6. Editor defaults to Doubao and sends the selected provider on regenerate.
  7. TTS behavior remains unchanged when the LLM provider changes.

Rollout Notes

  1. Introduce the new endpoint and provider abstraction first.
  2. Switch the editor to the new endpoint second.
  3. Keep MiniMax TTS untouched except for regression checks.
  4. Leave any deeper visual fallback provider work for a later pass if needed.

Constraints

  1. This workspace is not a Git repository, so the design document cannot be committed here.
  2. The user provided an Ark key in chat, but the implementation must still read provider secrets from .env and not hardcode them into source files.