Streaming
Stream responses token-by-token using Server-Sent Events (SSE) for faster time-to-first-token and a better user experience in interactive applications.
What you’ll learn:
- How to enable streaming for chat completions
- How to process streaming chunks in different languages
- How to handle streaming with function calling
- Error handling patterns for streams
Why Streaming?
Section titled “Why Streaming?”Without streaming, the API waits until the entire response is generated before sending it back. With streaming:
- Faster perceived response — The first token arrives in milliseconds instead of waiting seconds for the full response
- Better UX — Users see text appear progressively, like a human typing
- Lower memory usage — Process tokens as they arrive instead of buffering the full response
- Early termination — Stop generation mid-stream if the output isn’t what you need
Basic Streaming
Section titled “Basic Streaming”Enable streaming by setting stream: true in your request:
curl -X POST "$OPENAI_BASE_URL/chat/completions" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "Llama-3.3-70B-Instruct", "messages": [{"role": "user", "content": "Explain quantum computing"}], "stream": true }'The response is a stream of data: lines in SSE format:
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant"},"index":0}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Quantum"},"index":0}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"delta":{"content":" computing"},"index":0}]}
...
data: [DONE]from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create( model="Llama-3.3-70B-Instruct", messages=[{"role": "user", "content": "Explain quantum computing"}], stream=True,)
for chunk in stream: content = chunk.choices[0].delta.content if content: print(content, end="", flush=True)
print() # newline at endimport OpenAI from "openai";
const client = new OpenAI();
const stream = await client.chat.completions.create({ model: "Llama-3.3-70B-Instruct", messages: [{ role: "user", content: "Explain quantum computing" }], stream: true,});
for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { process.stdout.write(content); }}
console.log(); // newline at endCollecting the Full Response
Section titled “Collecting the Full Response”If you need both streaming output and the complete text:
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create( model="Llama-3.3-70B-Instruct", messages=[{"role": "user", "content": "List 5 EU countries"}], stream=True,)
full_response = []for chunk in stream: if not chunk.choices: continue content = chunk.choices[0].delta.content if content: full_response.append(content) print(content, end="", flush=True)
complete_text = "".join(full_response)print(f"\n\nTotal length: {len(complete_text)} characters")import OpenAI from "openai";
const client = new OpenAI();
const stream = await client.chat.completions.create({ model: "Llama-3.3-70B-Instruct", messages: [{ role: "user", content: "List 5 EU countries" }], stream: true,});
const chunks = [];for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { chunks.push(content); process.stdout.write(content); }}
const completeText = chunks.join("");console.log(`\n\nTotal length: ${completeText.length} characters`);Streaming with Function Calling
Section titled “Streaming with Function Calling”When streaming with tools/functions, tool call arguments arrive incrementally across chunks:
from openai import OpenAIimport json
client = OpenAI()
tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"}, }, "required": ["location"], }, }, }]
stream = client.chat.completions.create( model="gpt-4.1", messages=[{"role": "user", "content": "What's the weather in Berlin?"}], tools=tools, stream=True,)
tool_calls = {}for chunk in stream: if not chunk.choices: continue delta = chunk.choices[0].delta
if delta.tool_calls: for tc in delta.tool_calls: if tc.id: tool_calls[tc.index] = { "id": tc.id, "function": {"name": tc.function.name, "arguments": ""}, } if tc.function.arguments: tool_calls[tc.index]["function"]["arguments"] += tc.function.arguments
# Process completed tool callsfor tc in tool_calls.values(): args = json.loads(tc["function"]["arguments"]) print(f"Tool: {tc['function']['name']}, Args: {args}")Error Handling
Section titled “Error Handling”Handle connection drops and errors gracefully during streaming:
from openai import OpenAI, APIError, APIConnectionError
client = OpenAI()
def stream_with_retry(messages, max_retries=3): for attempt in range(max_retries): try: stream = client.chat.completions.create( model="Llama-3.3-70B-Instruct", messages=messages, stream=True, ) full_response = [] for chunk in stream: if not chunk.choices: continue content = chunk.choices[0].delta.content if content: full_response.append(content) print(content, end="", flush=True) print() return "".join(full_response)
except APIConnectionError: print(f"\nConnection lost. Retry {attempt + 1}/{max_retries}...") except APIError as e: print(f"\nAPI error: {e}. Retry {attempt + 1}/{max_retries}...")
raise Exception("Max retries exceeded")Stream Options
Section titled “Stream Options”Request usage statistics in the final chunk with stream_options:
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create( model="Llama-3.3-70B-Instruct", messages=[{"role": "user", "content": "Hello!"}], stream=True, stream_options={"include_usage": True},)
for chunk in stream: if chunk.usage: print(f"\nTokens — prompt: {chunk.usage.prompt_tokens}, " f"completion: {chunk.usage.completion_tokens}, " f"total: {chunk.usage.total_tokens}") elif chunk.choices and chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)Next Steps
Section titled “Next Steps”- Chat Completions — Non-streaming chat API usage
- Function Calling — Define and use tools with the API
- Asynchronous Requests — Queue-based processing for batch workloads
- Error Codes — Handle API errors