Token Reduction Layer

Provider-agnostic Node.js middleware for reducing LLM spend in long-running chat and copilot workflows.

This repo is most useful when your product has one or more of these traits:

long sessions
repeated or similar user asks
large system prompts
persistent brand, product, or workflow context

It is less useful for one-off stateless prompts. The win comes from session efficiency, not magic prompt compression.

What it does

Token Reduction Layer combines four practical levers:

Lean message cleanup
Intent-aware semantic cache
Compact memory for long threads
Provider-agnostic request packaging

It works with any model you can call from JavaScript through modelCaller(request): OpenAI, Anthropic, Gemini, Groq, Ollama, Together, local models, or your own gateway.

Realistic savings

These are the honest ranges for the current approach:

Workload	Typical savings
Short one-off prompts	5-15%
Medium chat sessions	15-35%
Long sessions with memory pressure	20-45%
Repetitive support/internal workflows	30-60%+

The biggest levers are usually:

shorter fixed prompt overhead
cache hit rate
memory compaction after history gets large

The smallest lever is usually word reordering inside a single short prompt.

Exact benchmark example

Run:

npm run bench

Current sample output using the gpt-4o tokenizer:

Scenario	Before	After	Saved
User prompt only	44	39	11.4%
Old JSON wrapper vs raw payload	47	39	17.0%
Old system prompt vs new system prompt	191	25	86.9%
4-message request total	313	139	55.6%
Long-thread memory compaction	265	235	11.3%

Important: summarizing a very short thread can cost more than it saves. This is why memory compaction should only kick in after history is genuinely large enough.

How it works

1. Lean message cleanup

The compressor removes filler and shortens known long phrases without stripping critical literals.

Could you please kindly explain how machine learning and deep learning work?
explain how ML and DL work?

By default the layer now sends a raw compressed message, because raw is cheaper than wrapping every turn in JSON or intent tags.

Optional payload modes:

raw
tagged
json

2. Intent-aware semantic cache

The cache is safer than a plain text-similarity lookup. It can reject hits when:

intent changes
important literals conflict
quoted strings conflict
numbers conflict
keyword overlap is too weak

That makes it more believable for production use in workflows where colors, prices, names, and constraints matter.

3. Compact thread memory

Long chats get reshaped into:

[anchor]
original brief

[facts]
brand=Gadiva Hair Oil
primary_color=#C9A84C
secondary_color=#1A1A1A
tagline=Rooted in Nature, Defined by You.

[recent]
last N raw turns

This preserves durable context while reducing the cost of replaying the middle of the conversation.

4. Universal request handoff

The layer can either call the model directly through your own SDK wrapper or prepare the optimized request bundle first.

Quick start

git clone <repo>
cd token-reduction-layer
npm test
npm run bench

Any-model usage

const { TokenReductionLayer } = require('./src');

const trl = new TokenReductionLayer({
  provider: 'custom',
  model: 'gpt-4o-mini',
  modelCaller: async request => {
    return 'model reply here';
  },
});

const result = await trl.chat(
  'Could you please write three launch taglines for Gadiva Hair Oil?'
);

console.log(result.reply);
console.log(result.compressionStats);

OpenAI example

See examples/openai-compatible.js.

Anthropic example

Anthropic still works out of the box:

const { TokenReductionLayer } = require('./src');

const trl = new TokenReductionLayer({
  provider: 'anthropic',
  anthropicApiKey: process.env.ANTHROPIC_API_KEY,
  model: 'claude-sonnet-4-20250514',
});

const result = await trl.chat('Explain retrieval augmented generation.');
console.log(result.reply);

`prepareRequest()`

If you want full control over the final API call:

const trl = new TokenReductionLayer({
  provider: 'custom',
  modelCaller: async () => 'ok',
});

await trl.chat('Write a homepage headline');
const request = trl.prepareRequest();

request contains:

system
messages
model
maxTokens
provider

Configuration

new TokenReductionLayer({
  provider: 'custom',
  model: 'gpt-4o-mini',
  maxTokens: 1024,

  modelCaller: async request => {},
  summarizerCaller: async request => {},

  compressionEnabled: true,
  payloadMode: 'raw',                  // 'raw' | 'tagged' | 'json'
  expandContractions: false,
  customAbbreviations: {
    'retrieval augmented generation': 'RAG',
    'customer relationship management': 'CRM',
  },

  cacheEnabled: true,
  cacheThreshold: 0.55,
  cacheMinSharedKeywords: 2,
  cacheRequireIntentMatch: true,
  cacheMaxEntries: 500,
  cacheTtlMs: null,

  summarizeEnabled: true,
  summarizeEveryNTurns: 10,
  keepRecentMessages: 8,
  factFormat: 'compact',
})

Where the savings usually come from

Lever	Typical effect	Notes
Message cleanup	5-20% per message	Best on chatty prompts
Raw payload instead of JSON	5-15% vs wrapped payloads	Small but steady
Semantic cache	100% on hits	Depends on repetition
Memory compaction	20-60% on long threads	Strongest once history grows
Combined session effect	20-45% typical	Higher on repetitive workloads

Files

token-reduction-layer/
|-- src/
|   |-- constants.js
|   |-- compressor.js
|   |-- cache.js
|   |-- summarizer.js
|   `-- index.js
|-- examples/
|   |-- basic.js
|   |-- express-middleware.js
|   `-- openai-compatible.js
|-- scripts/
|   `-- benchmark.js
|-- tests/
|   |-- compressor.test.js
|   |-- cache.test.js
|   `-- index.test.js
|-- calculator.html
|-- package.json
`-- README.md

Tests

npm test

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Token Reduction Layer

What it does

Realistic savings

Exact benchmark example

How it works

1. Lean message cleanup

2. Intent-aware semantic cache

3. Compact thread memory

4. Universal request handoff

Quick start

Any-model usage

OpenAI example

Anthropic example

`prepareRequest()`

Configuration

Where the savings usually come from

Files

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
node_modules		node_modules
scripts		scripts
src		src
tests		tests
README.md		README.md
calculator.html		calculator.html
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

Token Reduction Layer

What it does

Realistic savings

Exact benchmark example

How it works

1. Lean message cleanup

2. Intent-aware semantic cache

3. Compact thread memory

4. Universal request handoff

Quick start

Any-model usage

OpenAI example

Anthropic example

prepareRequest()

Configuration

Where the savings usually come from

Files

Tests

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`prepareRequest()`

Packages