Multimodal (Vision)
AI Foundation Services provides vision models that can analyze images alongside text. Use the same Chat Completions API with image content.
What you’ll learn:
- How to analyze images from URLs
- How to send local images via base64 encoding
- Which models support vision capabilities
Analyze an Image from URL
Section titled “Analyze an Image from URL”curl -X POST "$OPENAI_BASE_URL/chat/completions" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemini-2.5-flash", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "What is in this image?"}, {"type": "image_url", "image_url": {"url": "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400"}} ] } ], "max_tokens": 1024 }'from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create( model="gemini-2.5-flash", messages=[ { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": { "url": "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400" }, }, ], } ], max_tokens=1024,)
print(response.choices[0].message.content)import OpenAI from "openai";
const client = new OpenAI();
const response = await client.chat.completions.create({ model: "gemini-2.5-flash", messages: [ { role: "user", content: [ { type: "text", text: "What's in this image?" }, { type: "image_url", image_url: { url: "https://images.unsplash.com/photo-1546069901-ba9599a7e63c?w=400", }, }, ], }, ], max_tokens: 1024,});
console.log(response.choices[0].message.content);Analyze a Local Image (Base64)
Section titled “Analyze a Local Image (Base64)”You can also pass a local image as a base64-encoded string:
import base64
from openai import OpenAI
client = OpenAI()
def encode_image(image_path): with open(image_path, "rb") as f: return base64.b64encode(f.read()).decode("utf-8")
base64_image = encode_image("/path/to/your/image.jpg")
response = client.chat.completions.create( model="Qwen3-VL-30B-A3B-Instruct-FP8", messages=[ { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}, }, ], } ], max_tokens=1000,)
print(response.choices[0].message.content)Available Vision Models
Section titled “Available Vision Models”| Model | Provider | Capabilities |
|---|---|---|
Qwen3-VL-30B-A3B-Instruct-FP8 | T-Cloud (Germany) | Image understanding, OCR |
gemini-2.5-flash | Google Cloud | Image + video understanding |
gpt-4.1 | Azure | Image understanding |
Check Available Models for the latest list.
Next Steps
Section titled “Next Steps”- Visual RAG — Index and retrieve from documents with text + image understanding
- Function Calling — Connect models to external tools
- Streaming — Stream responses for better UX