Thank you @yiss for describing this so well
Problem
generateImage() and generateVideo() are currently centered around text-prompt inputs, but several providers and models support image-conditioned generation workflows.
Examples include:
- image-to-image generation
- prompt + reference image generation
- multi-reference image generation
- image-to-video generation
- video generation from a starting frame
- model-specific image editing / transformation workflows
Today there is no obvious provider-agnostic way to pass image inputs into generateImage() and generateVideo().
TanStack AI already has a clean multimodal abstraction for content parts (ImagePart with source.type: 'data' | 'url'). It would be great if media generation APIs reused that same shape instead of introducing provider-specific one-offs for image-conditioned generation.
Why this matters
Modern image and video models are increasingly multimodal. Generation is no longer only text-to-image or text-to-video.
A unified way to pass image inputs would make it much easier for adapters to support workflows like:
- image editing
- reference-guided generation
- image-to-video
- multi-image composition
Proposal
Add an optional inputs field to both generateImage() and generateVideo() that accepts reusable multimodal content parts, ideally existing ImagePart values.
This would provide a consistent, provider-agnostic way to pass image-conditioned inputs into media generation APIs.
Example API
generateImage()
import { generateImage, type ImagePart } from '@tanstack/ai'
const reference: ImagePart = {
type: 'image',
source: {
type: 'url',
value: 'https://example.com/reference.png',
},
}
await generateImage({
adapter: openaiImage('gpt-image-1.5'),
prompt: 'Turn this into a cinematic product photo',
inputs: [reference],
})
generateVideo()
import { generateVideo, type ImagePart } from '@tanstack/ai'
const startingFrame: ImagePart = {
type: 'image',
source: {
type: 'data',
value: base64Image,
mimeType: 'image/png',
},
}
await generateVideo({
adapter: googleVideo('veo-3.1'),
prompt: 'Animate this still into a slow cinematic push-in with subtle motion',
inputs: [startingFrame],
})
Multiple reference images
import { generateImage, type ImagePart } from '@tanstack/ai'
const product: ImagePart = {
type: 'image',
source: {
type: 'url',
value: 'https://example.com/product.png',
},
}
const style: ImagePart = {
type: 'image',
source: {
type: 'url',
value: 'https://example.com/style.png',
},
}
await generateImage({
adapter: geminiImage('nano-banana'),
prompt: 'Generate a new image of the product using the style of the second reference',
inputs: [product, style],
})
Expected behavior
generateImage() and generateVideo() should both accept image-conditioned inputs through the same field name.
- The input format should ideally reuse existing TanStack AI multimodal primitives such as
ImagePart.
- Adapters should map those inputs into the provider-native request shape.
- Unsupported combinations can be rejected by adapters at runtime or by adapter-specific validation.
- Providers that only support text prompts should continue to work unchanged.
Open design questions
- Should the field be named
inputs, references, or something else?
- Should it accept only
ImagePart[], or broader content parts for future extensibility?
- Should
generateVideo() support multiple input images as well, or only one initially?
Summary
Request: add a unified, provider-agnostic way to pass image-conditioned inputs into both generateImage() and generateVideo(), ideally by reusing existing multimodal content-part types such as ImagePart.
Originally posted by @yiss in #481
Thank you @yiss for describing this so well
Problem
generateImage()andgenerateVideo()are currently centered around text-prompt inputs, but several providers and models support image-conditioned generation workflows.Examples include:
Today there is no obvious provider-agnostic way to pass image inputs into
generateImage()andgenerateVideo().TanStack AI already has a clean multimodal abstraction for content parts (
ImagePartwithsource.type: 'data' | 'url'). It would be great if media generation APIs reused that same shape instead of introducing provider-specific one-offs for image-conditioned generation.Why this matters
Modern image and video models are increasingly multimodal. Generation is no longer only text-to-image or text-to-video.
A unified way to pass image inputs would make it much easier for adapters to support workflows like:
Proposal
Add an optional
inputsfield to bothgenerateImage()andgenerateVideo()that accepts reusable multimodal content parts, ideally existingImagePartvalues.This would provide a consistent, provider-agnostic way to pass image-conditioned inputs into media generation APIs.
Example API
generateImage()generateVideo()Multiple reference images
Expected behavior
generateImage()andgenerateVideo()should both accept image-conditioned inputs through the same field name.ImagePart.Open design questions
inputs,references, or something else?ImagePart[], or broader content parts for future extensibility?generateVideo()support multiple input images as well, or only one initially?Summary
Request: add a unified, provider-agnostic way to pass image-conditioned inputs into both
generateImage()andgenerateVideo(), ideally by reusing existing multimodal content-part types such asImagePart.Originally posted by @yiss in #481