Rate Limits
Rate limits protect the service and ensure fair usage. Limits are defined per plan tier and vary by model.
Limits by Plan Tier
Section titled “Limits by Plan Tier”| Metric | Basic | Standard4000 |
|---|---|---|
| Tokens per Minute (TPM) | 18,750 input | Up to 150,000 input |
| Requests per Minute (RPM) | 20 | Up to 600 |
Exact limits vary by model. See the Plans & Pricing page for per-model details, or download the Service Description PDF.
How Rate Limits Work
Section titled “How Rate Limits Work”- TPM (Tokens per Minute) — Maximum number of input tokens processed per minute
- RPM (Requests per Minute) — Maximum number of API requests per minute
- Limits are applied per API key
- Both input and output tokens count toward TPM limits
Rate Limit Response Headers
Section titled “Rate Limit Response Headers”Responses from Azure-hosted models (GPT, o-series) include headers that help you track usage against limits:
| Header | Description |
|---|---|
x-ratelimit-limit-requests | Maximum requests allowed per minute |
x-ratelimit-limit-tokens | Maximum tokens allowed per minute |
x-ratelimit-remaining-requests | Requests remaining in the current window |
x-ratelimit-remaining-tokens | Tokens remaining in the current window |
x-ratelimit-reset-requests | Time until the request limit resets |
x-ratelimit-reset-tokens | Time until the token limit resets |
Reading Headers
Section titled “Reading Headers”import osimport httpx
# Use httpx directly to inspect response headersresponse = httpx.post( "https://llm-server.llmhub.t-systems.net/v2/chat/completions", headers={"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}, json={ "model": "gpt-4.1", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 10, },)
print(f"Requests remaining: {response.headers.get('x-ratelimit-remaining-requests')}")print(f"Tokens remaining: {response.headers.get('x-ratelimit-remaining-tokens')}")print(f"Resets in: {response.headers.get('x-ratelimit-reset-requests')}")curl -i -X POST "$OPENAI_BASE_URL/chat/completions" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4.1", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 10 }'
# The -i flag shows response headers including:# x-ratelimit-remaining-requests: 19# x-ratelimit-remaining-tokens: 18740# x-ratelimit-reset-requests: 3sHandling Rate Limits
Section titled “Handling Rate Limits”When you exceed a rate limit, the API returns a 429 Too Many Requests error. Best practices:
- Monitor response headers — Check
x-ratelimit-remaining-*headers to avoid hitting limits - Implement exponential backoff — Wait longer between retries
- Batch requests — Combine multiple small requests into fewer larger ones
- Cache responses — Avoid repeating identical requests
- Use the Queue API — For batch workloads, use asynchronous requests to spread load
import time
from openai import OpenAI, RateLimitError
client = OpenAI()
def safe_completion(messages, max_retries=3): for attempt in range(max_retries): try: return client.chat.completions.create( model="Llama-3.3-70B-Instruct", messages=messages, ) except RateLimitError: wait_time = 2 ** attempt print(f"Rate limited. Waiting {wait_time}s...") time.sleep(wait_time) raise Exception("Max retries exceeded")Need Higher Limits?
Section titled “Need Higher Limits?”- Upgrade your plan — Higher tiers have significantly higher TPM and RPM limits
- Dedicated instances — For enterprise workloads, contact us for dedicated GPU resources with custom rate limits
- Contact: T-Cloud Marketplace or reach out to the AIFS team at ai@t-systems.com