Skip to main content

Context Compression

POST /v1/chat/completions
POST /v1/messages
POST /v1/responses

Context compression automatically summarizes older conversation history to reduce token usage and cost, while preserving recent context. This is especially useful for long-running conversations where token counts grow significantly.

How It Works

When enabled, Apertis analyzes the conversation before forwarding it to the target model:

  1. Token threshold check — Compression only triggers when the conversation exceeds a configurable token threshold
  2. Message segmentation — Messages are split into system prompts, compressible history, and recent turns (which are always preserved)
  3. Cost-effectiveness check — Compression is skipped if the cost of running the compression model exceeds the estimated savings
  4. Summarization — Older messages are summarized by a lightweight model (e.g., gpt-4.1-mini, claude-haiku-4.5, or gemini-3-flash-preview)
  5. Injection — The summary replaces the older messages, and recent turns are preserved verbatim
Transparent to Your Application

Compression happens at the gateway level. Your application sends the full conversation as usual — Apertis handles compression automatically and returns the response as normal. The only visible difference is reduced token usage.

Enabling Compression

There are three ways to enable context compression, with the following priority order:

Request body > Request headers > Token-level defaults

Add a compression object to your request body. This works with the OpenAI SDK via extra_body and with the Anthropic SDK via extra_body.

OpenAI SDK (Python)

from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.apertis.ai/v1"
)

response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[
{"role": "user", "content": "Hello!"}
],
extra_body={
"compression": {
"enabled": True,
"strategy": "on", # "on", "conservative", or "aggressive"
"threshold": 8000, # Token threshold to trigger compression
"keep_turns": 6, # Recent turns to always preserve
"model": "auto" # Compression model ("auto" or specific model)
}
}
)

print(response.choices[0].message.content)

Anthropic SDK (Python)

import anthropic

client = anthropic.Anthropic(
api_key="YOUR_API_KEY",
base_url="https://api.apertis.ai"
)

message = client.messages.create(
model="claude-sonnet-4.5",
max_tokens=1024,
messages=[
{"role": "user", "content": "Hello!"}
],
extra_body={
"compression": {
"enabled": True,
"strategy": "on",
"model": "gpt-4.1-mini"
}
}
)

print(message.content[0].text)

cURL

curl https://api.apertis.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4.1-mini",
"messages": [
{"role": "user", "content": "Hello!"}
],
"compression": {
"enabled": true,
"strategy": "on",
"model": "gpt-4.1-mini"
}
}'

OpenAI SDK — Responses API (Python)

Compression also works with the /v1/responses endpoint. The input field (string or array of messages) is automatically converted for compression.

from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.apertis.ai/v1"
)

response = client.responses.create(
model="o4-mini",
input=[
{"role": "user", "content": "Explain distributed systems"},
{"role": "assistant", "content": "Distributed systems are..."},
# ... long conversation history ...
{"role": "user", "content": "Summarize the key points"}
],
extra_body={
"compression": {
"enabled": True,
"strategy": "aggressive",
"model": "gpt-4.1-mini"
}
}
)

print(response.output[0].content[0].text)

cURL — Responses API

curl https://api.apertis.ai/v1/responses \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "o4-mini",
"input": [
{"role": "user", "content": "Hello!"}
],
"compression": {
"enabled": true,
"strategy": "on",
"model": "gpt-4.1-mini"
}
}'

Node.js / TypeScript

const response = await fetch('https://api.apertis.ai/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4.1-mini',
messages: [
{ role: 'user', content: 'Hello!' }
],
compression: {
enabled: true,
strategy: 'on',
model: 'gpt-4.1-mini'
}
})
});

const data = await response.json();
console.log(data.choices[0].message.content);

Method 2: Request Headers

Add compression headers to individual requests. Useful for cURL or custom HTTP clients. Works with all supported endpoints.

curl https://api.apertis.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Context-Compression: on" \
-H "X-Compression-Threshold: 8000" \
-H "X-Compression-Keep-Turns: 6" \
-H "X-Compression-Model: auto" \
-d '{
"model": "gpt-4.1-mini",
"messages": [{"role": "user", "content": "Hello!"}]
}'

The same headers work with /v1/responses:

curl https://api.apertis.ai/v1/responses \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Context-Compression: aggressive" \
-H "X-Compression-Model: gpt-4.1-mini" \
-d '{
"model": "o4-mini",
"input": [{"role": "user", "content": "Hello!"}]
}'
HeaderValuesDescription
X-Context-Compressionon, conservative, aggressive, offCompression strategy
X-Compression-Thresholdinteger (e.g., 8000)Token count threshold to trigger compression
X-Compression-Keep-Turnsinteger (e.g., 6)Number of recent turns to always preserve
X-Compression-Modelmodel ID or autoModel used for summarization

Method 3: Token-Level Defaults

Configure compression defaults on your API key via the Apertis dashboard. Go to API KeysEditCompression tab.

This sets default compression behavior for all requests made with that key, without any code changes. Per-request settings (body or headers) override token-level defaults.

Configuration Parameters

compression Object

ParameterTypeDefaultDescription
enabledbooleanfalseEnable or disable compression for this request
strategystring"on"Compression strategy — controls how aggressively older messages are summarized (see Strategies below)
thresholdinteger8000Minimum total token count before compression activates. Conversations shorter than this threshold are sent to the model as-is, without any compression
keep_turnsintegerPer strategyNumber of most recent conversation turns to keep uncompressed. Older turns beyond this count are summarized into a condensed context. Set to 0 to use the strategy default
modelstring"auto"Model used to generate conversation summaries. Use "auto" to let Apertis select a cost-efficient model, or specify a model ID from your available models

Strategies

StrategyKeep TurnsDescription
on6Balanced — good default for most use cases. Keeps 6 recent turns uncompressed, summarizes older history
conservative8Preserves more context — keeps 8 recent turns, compresses less aggressively. Best when recent conversation details matter
aggressive3Maximum token savings — keeps only 3 recent turns, summarizes everything else. Best for very long conversations where cost reduction is the priority

Auto Model Selection

When model is set to "auto" (default), Apertis automatically selects a cost-efficient compression model based on the target model you are calling:

  • Claude modelsclaude-haiku-4.5
  • Gemini modelsgemini-3-flash-preview
  • All other models (OpenAI, etc.) → gpt-4.1-mini

The auto-selection avoids using the same model for both compression and the target request (e.g., if you call claude-haiku-4.5, compression falls back to gpt-4.1-mini instead).

You can also specify any available model explicitly (e.g., "gpt-4.1-mini", "claude-haiku-4.5", "gemini-3-flash-preview"). You can browse available models on the dashboard's Model Detail page or select from the dropdown in the API Key Compression tab.

Response Headers

When compression is enabled, Apertis adds response headers to indicate compression status:

When Compression is Applied

HeaderExampleDescription
X-Compression-AppliedtrueCompression was applied
X-Compression-Original-Tokens50000Original token count before compression
X-Compression-Final-Tokens8000Token count after compression
X-Compression-Savings84%Percentage of tokens saved

When Compression is Skipped

HeaderExampleDescription
X-Compression-AppliedfalseCompression was not applied
X-Compression-Errorcompression-not-cost-effectiveMachine-readable reason
X-Compression-Message(human-readable explanation)Detailed explanation

Error Codes

Error CodeMeaning
compression-not-cost-effectiveThe cost of running the compression model exceeds the estimated token savings. Common with short conversations.
compression-model-unavailable:<model>The specified compression model is not available. Use "auto" or check the model name.
compression-call-failedThe compression model call failed. The original request proceeds without compression.

Multimodal Safety

Compression automatically protects multimodal content:

  • Messages containing images, audio, or documents are moved to the "recent" segment and are never compressed
  • Only text-based messages in the compressible history are summarized

Cost Considerations

Compression has its own cost (the summarization call), so Apertis performs an automatic cost-effectiveness check before compressing:

  • If the estimated savings from compression outweigh the cost of the summarization call, compression proceeds
  • If compression would cost more than it saves (common with short conversations), it is skipped automatically
  • The X-Compression-Error: compression-not-cost-effective header indicates when this happens
When to Enable Compression

Context compression is most effective for:

  • Long-running conversations (20+ turns)
  • Expensive target models (e.g., GPT-4, Claude Opus) where token savings are significant
  • Chatbot applications where conversations accumulate over time

It is less effective for:

  • Short conversations (< 10 turns)
  • Cheap models where the compression cost may exceed savings
  • Single-turn requests

Supported Endpoints

EndpointSupported
/v1/chat/completionsYes
/v1/messagesYes
/v1/responsesYes
/v1/images/generationsNo
/v1/audio/*No
/v1/embeddingsNo

Graceful Degradation

Compression never blocks your main request. If compression fails for any reason:

  • The original uncompressed conversation is forwarded to the target model
  • The response is returned normally (HTTP 200)
  • Compression status is communicated via response headers only

Your application does not need special error handling for compression failures.