image-to-image and image-to-video support

Thank you @yiss for describing this so well

## Problem

`generateImage()` and `generateVideo()` are currently centered around text-prompt inputs, but several providers and models support image-conditioned generation workflows.

Examples include:

- image-to-image generation
- prompt + reference image generation
- multi-reference image generation
- image-to-video generation
- video generation from a starting frame
- model-specific image editing / transformation workflows

Today there is no obvious provider-agnostic way to pass image inputs into `generateImage()` and `generateVideo()`.

TanStack AI already has a clean multimodal abstraction for content parts (`ImagePart` with `source.type: 'data' | 'url'`). It would be great if media generation APIs reused that same shape instead of introducing provider-specific one-offs for image-conditioned generation.

## Why this matters

Modern image and video models are increasingly multimodal. Generation is no longer only text-to-image or text-to-video.

A unified way to pass image inputs would make it much easier for adapters to support workflows like:

- image editing
- reference-guided generation
- image-to-video
- multi-image composition

## Proposal

Add an optional `inputs` field to both `generateImage()` and `generateVideo()` that accepts reusable multimodal content parts, ideally existing `ImagePart` values.

This would provide a consistent, provider-agnostic way to pass image-conditioned inputs into media generation APIs.

## Example API

### `generateImage()`

```ts
import { generateImage, type ImagePart } from '@tanstack/ai'

const reference: ImagePart = {
  type: 'image',
  source: {
    type: 'url',
    value: 'https://example.com/reference.png',
  },
}

await generateImage({
  adapter: openaiImage('gpt-image-1.5'),
  prompt: 'Turn this into a cinematic product photo',
  inputs: [reference],
})
```

### `generateVideo()`

```ts
import { generateVideo, type ImagePart } from '@tanstack/ai'

const startingFrame: ImagePart = {
  type: 'image',
  source: {
    type: 'data',
    value: base64Image,
    mimeType: 'image/png',
  },
}

await generateVideo({
  adapter: googleVideo('veo-3.1'),
  prompt: 'Animate this still into a slow cinematic push-in with subtle motion',
  inputs: [startingFrame],
})
```

### Multiple reference images

```ts
import { generateImage, type ImagePart } from '@tanstack/ai'

const product: ImagePart = {
  type: 'image',
  source: {
    type: 'url',
    value: 'https://example.com/product.png',
  },
}

const style: ImagePart = {
  type: 'image',
  source: {
    type: 'url',
    value: 'https://example.com/style.png',
  },
}

await generateImage({
  adapter: geminiImage('nano-banana'),
  prompt: 'Generate a new image of the product using the style of the second reference',
  inputs: [product, style],
})
```

## Expected behavior

- `generateImage()` and `generateVideo()` should both accept image-conditioned inputs through the same field name.
- The input format should ideally reuse existing TanStack AI multimodal primitives such as `ImagePart`.
- Adapters should map those inputs into the provider-native request shape.
- Unsupported combinations can be rejected by adapters at runtime or by adapter-specific validation.
- Providers that only support text prompts should continue to work unchanged.

## Open design questions

- Should the field be named `inputs`, `references`, or something else?
- Should it accept only `ImagePart[]`, or broader content parts for future extensibility?
- Should `generateVideo()` support multiple input images as well, or only one initially?

## Summary

Request: add a unified, provider-agnostic way to pass image-conditioned inputs into both `generateImage()` and `generateVideo()`, ideally by reusing existing multimodal content-part types such as `ImagePart`.

_Originally posted by @yiss in https://github.com/TanStack/ai/discussions/481_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

image-to-image and image-to-video support #618

Problem

Why this matters

Proposal

Example API

`generateImage()`

`generateVideo()`

Multiple reference images

Expected behavior

Open design questions

Summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

image-to-image and image-to-video support #618

Description

Problem

Why this matters

Proposal

Example API

generateImage()

generateVideo()

Multiple reference images

Expected behavior

Open design questions

Summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`generateImage()`

`generateVideo()`