Skip to content

Asynchronous Requests (Queue API)

Submit long-running API requests to a queue for asynchronous processing. The Queue API mirrors the standard endpoints but processes requests in the background, making it ideal for batch workloads and large inference jobs.

What you’ll learn:

  • How to submit requests to the Queue API
  • When to use async vs. synchronous endpoints
  • How to poll for and retrieve results

Use the /queue endpoints when:

  • Long-running inference — Large models or complex prompts that may exceed synchronous timeout limits
  • Batch processing — Submitting many requests without waiting for each to complete
  • Background jobs — Fire-and-forget workloads where you process results later
  • Rate limit management — Spreading load across time instead of bursting

For interactive or real-time use cases (chatbots, streaming), use the standard /v2 endpoints instead.

Every standard endpoint has a /queue equivalent:

Standard EndpointQueue EndpointDescription
POST /v2/chat/completionsPOST /queue/chat/completionsChat completion
POST /v2/completionsPOST /queue/completionsText completion
POST /v2/embeddingsPOST /queue/embeddingsEmbeddings
POST /v2/audio/transcriptionsPOST /queue/audio/transcriptionsAudio transcription
POST /v2/audio/translationsPOST /queue/audio/translationsAudio translation
POST /v2/images/generationsPOST /queue/images/generationsImage generation
POST /v2/images/editsPOST /queue/images/editsImage editing
GET /v2/modelsGET /queue/modelsList models

The request body for each queue endpoint is identical to its standard counterpart — just change the base path from /v2 to /queue.

Submit a chat completion to the queue:

Terminal window
curl -X POST "https://llm-server.llmhub.t-systems.net/queue/chat/completions" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.3-70B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a detailed analysis of renewable energy trends in Europe."}
],
"max_tokens": 2000
}'

Process multiple prompts efficiently using the queue:

import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://llm-server.llmhub.t-systems.net/queue",
)
prompts = [
"Summarize the benefits of solar energy.",
"Explain how wind turbines generate electricity.",
"Describe the future of hydrogen fuel cells.",
]
async def process_prompt(prompt):
response = await client.chat.completions.create(
model="Llama-3.3-70B-Instruct",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
)
return response.choices[0].message.content
async def main():
tasks = [process_prompt(p) for p in prompts]
results = await asyncio.gather(*tasks)
for prompt, result in zip(prompts, results):
print(f"Prompt: {prompt[:50]}...")
print(f"Result: {result[:100]}...\n")
asyncio.run(main())
from openai import OpenAI
client = OpenAI(
base_url="https://llm-server.llmhub.t-systems.net/queue",
)
response = client.embeddings.create(
model="text-embedding-bge-m3",
input="The benefits of renewable energy in Europe",
)
print(f"Embedding dimension: {len(response.data[0].embedding)}")
from openai import OpenAI
client = OpenAI(
base_url="https://llm-server.llmhub.t-systems.net/queue",
)
with open("meeting_recording.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
)
print(transcript.text)
from openai import OpenAI
client = OpenAI(
base_url="https://llm-server.llmhub.t-systems.net/queue",
)
result = client.images.generate(
model="gpt-image-1",
prompt="A futuristic data center powered by renewable energy",
)
print(result.data[0].b64_json[:50] + "...")