Context Compression

POST /v1/chat/completions
POST /v1/messages
POST /v1/responses

Context compression automatically summarizes older conversation history to reduce token usage and cost, while preserving recent context. This is especially useful for long-running conversations where token counts grow significantly.

How It Works

When enabled, Apertis analyzes the conversation before forwarding it to the target model:

Token threshold check — Compression only triggers when the conversation exceeds a configurable token threshold
Message segmentation — Messages are split into system prompts, compressible history, and recent turns (which are always preserved)
Cost-effectiveness check — Compression is skipped if the cost of running the compression model exceeds the estimated savings
Summarization — Older messages are summarized by a lightweight model (e.g., gpt-4.1-mini, claude-haiku-4.5, or gemini-3-flash-preview)
Injection — The summary replaces the older messages, and recent turns are preserved verbatim

Transparent to Your Application

Compression happens at the gateway level. Your application sends the full conversation as usual — Apertis handles compression automatically and returns the response as normal. The only visible difference is reduced token usage.

Enabling Compression

There are three ways to enable context compression, with the following priority order:

Request body > Request headers > Token-level defaults

Method 1: Request Body (Recommended for SDKs)

Add a compression object to your request body. This works with the OpenAI SDK via extra_body and with the Anthropic SDK via extra_body.

OpenAI SDK (Python)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.apertis.ai/v1"
)

response = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    extra_body={
        "compression": {
            "enabled": True,
            "strategy": "on",         # "on", "conservative", or "aggressive"
            "threshold": 8000,         # Token threshold to trigger compression
            "keep_turns": 6,           # Recent turns to always preserve
            "model": "auto"            # Compression model ("auto" or specific model)
        }
    }
)

print(response.choices[0].message.content)

Anthropic SDK (Python)

import anthropic

client = anthropic.Anthropic(
    api_key="YOUR_API_KEY",
    base_url="https://api.apertis.ai"
)

message = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    extra_body={
        "compression": {
            "enabled": True,
            "strategy": "on",
            "model": "gpt-4.1-mini"
        }
    }
)

print(message.content[0].text)

cURL

curl https://api.apertis.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4.1-mini",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "compression": {
      "enabled": true,
      "strategy": "on",
      "model": "gpt-4.1-mini"
    }
  }'

OpenAI SDK — Responses API (Python)

Compression also works with the /v1/responses endpoint. The input field (string or array of messages) is automatically converted for compression.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.apertis.ai/v1"
)

response = client.responses.create(
    model="o4-mini",
    input=[
        {"role": "user", "content": "Explain distributed systems"},
        {"role": "assistant", "content": "Distributed systems are..."},
        # ... long conversation history ...
        {"role": "user", "content": "Summarize the key points"}
    ],
    extra_body={
        "compression": {
            "enabled": True,
            "strategy": "aggressive",
            "model": "gpt-4.1-mini"
        }
    }
)

print(response.output[0].content[0].text)

cURL — Responses API

curl https://api.apertis.ai/v1/responses \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "o4-mini",
    "input": [
      {"role": "user", "content": "Hello!"}
    ],
    "compression": {
      "enabled": true,
      "strategy": "on",
      "model": "gpt-4.1-mini"
    }
  }'

Node.js / TypeScript

const response = await fetch('https://api.apertis.ai/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'gpt-4.1-mini',
    messages: [
      { role: 'user', content: 'Hello!' }
    ],
    compression: {
      enabled: true,
      strategy: 'on',
      model: 'gpt-4.1-mini'
    }
  })
});

const data = await response.json();
console.log(data.choices[0].message.content);

Method 2: Request Headers

Add compression headers to individual requests. Useful for cURL or custom HTTP clients. Works with all supported endpoints.

curl https://api.apertis.ai/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Context-Compression: on" \
  -H "X-Compression-Threshold: 8000" \
  -H "X-Compression-Keep-Turns: 6" \
  -H "X-Compression-Model: auto" \
  -d '{
    "model": "gpt-4.1-mini",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The same headers work with /v1/responses:

curl https://api.apertis.ai/v1/responses \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Context-Compression: aggressive" \
  -H "X-Compression-Model: gpt-4.1-mini" \
  -d '{
    "model": "o4-mini",
    "input": [{"role": "user", "content": "Hello!"}]
  }'

Header	Values	Description
`X-Context-Compression`	`on`, `conservative`, `aggressive`, `off`	Compression strategy
`X-Compression-Threshold`	integer (e.g., `8000`)	Token count threshold to trigger compression
`X-Compression-Keep-Turns`	integer (e.g., `6`)	Number of recent turns to always preserve
`X-Compression-Model`	model ID or `auto`	Model used for summarization

Method 3: Token-Level Defaults

Configure compression defaults on your API key via the Apertis dashboard. Go to API Keys → Edit → Compression tab.

This sets default compression behavior for all requests made with that key, without any code changes. Per-request settings (body or headers) override token-level defaults.

Configuration Parameters

`compression` Object

Parameter	Type	Default	Description
`enabled`	boolean	`false`	Enable or disable compression for this request
`strategy`	string	`"on"`	Compression strategy — controls how aggressively older messages are summarized (see Strategies below)
`threshold`	integer	`8000`	Minimum total token count before compression activates. Conversations shorter than this threshold are sent to the model as-is, without any compression
`keep_turns`	integer	Per strategy	Number of most recent conversation turns to keep uncompressed. Older turns beyond this count are summarized into a condensed context. Set to `0` to use the strategy default
`model`	string	`"auto"`	Model used to generate conversation summaries. Use `"auto"` to let Apertis select a cost-efficient model, or specify a model ID from your available models

Strategies

Strategy	Keep Turns	Description
`on`	6	Balanced — good default for most use cases. Keeps 6 recent turns uncompressed, summarizes older history
`conservative`	8	Preserves more context — keeps 8 recent turns, compresses less aggressively. Best when recent conversation details matter
`aggressive`	3	Maximum token savings — keeps only 3 recent turns, summarizes everything else. Best for very long conversations where cost reduction is the priority

Auto Model Selection

When model is set to "auto" (default), Apertis automatically selects a cost-efficient compression model based on the target model you are calling:

Claude models → claude-haiku-4.5
Gemini models → gemini-3-flash-preview
All other models (OpenAI, etc.) → gpt-4.1-mini

The auto-selection avoids using the same model for both compression and the target request (e.g., if you call claude-haiku-4.5, compression falls back to gpt-4.1-mini instead).

You can also specify any available model explicitly (e.g., "gpt-4.1-mini", "claude-haiku-4.5", "gemini-3-flash-preview"). You can browse available models on the dashboard's Model Detail page or select from the dropdown in the API Key Compression tab.

Response Headers

When compression is enabled, Apertis adds response headers to indicate compression status:

When Compression is Applied

Header	Example	Description
`X-Compression-Applied`	`true`	Compression was applied
`X-Compression-Original-Tokens`	`50000`	Original token count before compression
`X-Compression-Final-Tokens`	`8000`	Token count after compression
`X-Compression-Savings`	`84%`	Percentage of tokens saved

When Compression is Skipped

Header	Example	Description
`X-Compression-Applied`	`false`	Compression was not applied
`X-Compression-Error`	`compression-not-cost-effective`	Machine-readable reason
`X-Compression-Message`	(human-readable explanation)	Detailed explanation

Error Codes

Error Code	Meaning
`compression-not-cost-effective`	The cost of running the compression model exceeds the estimated token savings. Common with short conversations.
`compression-model-unavailable:<model>`	The specified compression model is not available. Use `"auto"` or check the model name.
`compression-call-failed`	The compression model call failed. The original request proceeds without compression.

Multimodal Safety

Compression automatically protects multimodal content:

Messages containing images, audio, or documents are moved to the "recent" segment and are never compressed
Only text-based messages in the compressible history are summarized

Cost Considerations

Compression has its own cost (the summarization call), so Apertis performs an automatic cost-effectiveness check before compressing:

If the estimated savings from compression outweigh the cost of the summarization call, compression proceeds
If compression would cost more than it saves (common with short conversations), it is skipped automatically
The X-Compression-Error: compression-not-cost-effective header indicates when this happens

When to Enable Compression

Context compression is most effective for:

Long-running conversations (20+ turns)
Expensive target models (e.g., GPT-4, Claude Opus) where token savings are significant
Chatbot applications where conversations accumulate over time

It is less effective for:

Short conversations (< 10 turns)
Cheap models where the compression cost may exceed savings
Single-turn requests

Supported Endpoints

Endpoint	Supported
`/v1/chat/completions`	Yes
`/v1/messages`	Yes
`/v1/responses`	Yes
`/v1/images/generations`	No
`/v1/audio/*`	No
`/v1/embeddings`	No

Graceful Degradation

Compression never blocks your main request. If compression fails for any reason:

The original uncompressed conversation is forwarded to the target model
The response is returned normally (HTTP 200)
Compression status is communicated via response headers only

Your application does not need special error handling for compression failures.

How It Works​

Enabling Compression​

Method 1: Request Body (Recommended for SDKs)​

OpenAI SDK (Python)​

Anthropic SDK (Python)​

cURL​

OpenAI SDK — Responses API (Python)​

cURL — Responses API​

Node.js / TypeScript​

Method 2: Request Headers​

Method 3: Token-Level Defaults​

Configuration Parameters​

compression Object​

Strategies​

Auto Model Selection​

Response Headers​

When Compression is Applied​

When Compression is Skipped​

Error Codes​

Multimodal Safety​

Cost Considerations​

Supported Endpoints​

Graceful Degradation​