Skip to content

adriansanthosh77-dev/TokenSlim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Token Reduction Layer

Provider-agnostic Node.js middleware for reducing LLM spend in long-running chat and copilot workflows.

This repo is most useful when your product has one or more of these traits:

  • long sessions
  • repeated or similar user asks
  • large system prompts
  • persistent brand, product, or workflow context

It is less useful for one-off stateless prompts. The win comes from session efficiency, not magic prompt compression.

What it does

Token Reduction Layer combines four practical levers:

  1. Lean message cleanup
  2. Intent-aware semantic cache
  3. Compact memory for long threads
  4. Provider-agnostic request packaging

It works with any model you can call from JavaScript through modelCaller(request): OpenAI, Anthropic, Gemini, Groq, Ollama, Together, local models, or your own gateway.

Realistic savings

These are the honest ranges for the current approach:

Workload Typical savings
Short one-off prompts 5-15%
Medium chat sessions 15-35%
Long sessions with memory pressure 20-45%
Repetitive support/internal workflows 30-60%+

The biggest levers are usually:

  • shorter fixed prompt overhead
  • cache hit rate
  • memory compaction after history gets large

The smallest lever is usually word reordering inside a single short prompt.

Exact benchmark example

Run:

npm run bench

Current sample output using the gpt-4o tokenizer:

Scenario Before After Saved
User prompt only 44 39 11.4%
Old JSON wrapper vs raw payload 47 39 17.0%
Old system prompt vs new system prompt 191 25 86.9%
4-message request total 313 139 55.6%
Long-thread memory compaction 265 235 11.3%

Important: summarizing a very short thread can cost more than it saves. This is why memory compaction should only kick in after history is genuinely large enough.

How it works

1. Lean message cleanup

The compressor removes filler and shortens known long phrases without stripping critical literals.

Could you please kindly explain how machine learning and deep learning work?
explain how ML and DL work?

By default the layer now sends a raw compressed message, because raw is cheaper than wrapping every turn in JSON or intent tags.

Optional payload modes:

  • raw
  • tagged
  • json

2. Intent-aware semantic cache

The cache is safer than a plain text-similarity lookup. It can reject hits when:

  • intent changes
  • important literals conflict
  • quoted strings conflict
  • numbers conflict
  • keyword overlap is too weak

That makes it more believable for production use in workflows where colors, prices, names, and constraints matter.

3. Compact thread memory

Long chats get reshaped into:

[anchor]
original brief

[facts]
brand=Gadiva Hair Oil
primary_color=#C9A84C
secondary_color=#1A1A1A
tagline=Rooted in Nature, Defined by You.

[recent]
last N raw turns

This preserves durable context while reducing the cost of replaying the middle of the conversation.

4. Universal request handoff

The layer can either call the model directly through your own SDK wrapper or prepare the optimized request bundle first.

Quick start

git clone <repo>
cd token-reduction-layer
npm test
npm run bench

Any-model usage

const { TokenReductionLayer } = require('./src');

const trl = new TokenReductionLayer({
  provider: 'custom',
  model: 'gpt-4o-mini',
  modelCaller: async request => {
    return 'model reply here';
  },
});

const result = await trl.chat(
  'Could you please write three launch taglines for Gadiva Hair Oil?'
);

console.log(result.reply);
console.log(result.compressionStats);

OpenAI example

See examples/openai-compatible.js.

Anthropic example

Anthropic still works out of the box:

const { TokenReductionLayer } = require('./src');

const trl = new TokenReductionLayer({
  provider: 'anthropic',
  anthropicApiKey: process.env.ANTHROPIC_API_KEY,
  model: 'claude-sonnet-4-20250514',
});

const result = await trl.chat('Explain retrieval augmented generation.');
console.log(result.reply);

prepareRequest()

If you want full control over the final API call:

const trl = new TokenReductionLayer({
  provider: 'custom',
  modelCaller: async () => 'ok',
});

await trl.chat('Write a homepage headline');
const request = trl.prepareRequest();

request contains:

  • system
  • messages
  • model
  • maxTokens
  • provider

Configuration

new TokenReductionLayer({
  provider: 'custom',
  model: 'gpt-4o-mini',
  maxTokens: 1024,

  modelCaller: async request => {},
  summarizerCaller: async request => {},

  compressionEnabled: true,
  payloadMode: 'raw',                  // 'raw' | 'tagged' | 'json'
  expandContractions: false,
  customAbbreviations: {
    'retrieval augmented generation': 'RAG',
    'customer relationship management': 'CRM',
  },

  cacheEnabled: true,
  cacheThreshold: 0.55,
  cacheMinSharedKeywords: 2,
  cacheRequireIntentMatch: true,
  cacheMaxEntries: 500,
  cacheTtlMs: null,

  summarizeEnabled: true,
  summarizeEveryNTurns: 10,
  keepRecentMessages: 8,
  factFormat: 'compact',
})

Where the savings usually come from

Lever Typical effect Notes
Message cleanup 5-20% per message Best on chatty prompts
Raw payload instead of JSON 5-15% vs wrapped payloads Small but steady
Semantic cache 100% on hits Depends on repetition
Memory compaction 20-60% on long threads Strongest once history grows
Combined session effect 20-45% typical Higher on repetitive workloads

Files

token-reduction-layer/
|-- src/
|   |-- constants.js
|   |-- compressor.js
|   |-- cache.js
|   |-- summarizer.js
|   `-- index.js
|-- examples/
|   |-- basic.js
|   |-- express-middleware.js
|   `-- openai-compatible.js
|-- scripts/
|   `-- benchmark.js
|-- tests/
|   |-- compressor.test.js
|   |-- cache.test.js
|   `-- index.test.js
|-- calculator.html
|-- package.json
`-- README.md

Tests

npm test

License

MIT

About

Surgical middleware for LLMs. Reduces token waste by 40-60% using Triple-Tier Memory (Anchor + Fact Sheet + Buffer) and Semantic Caching. Built for production-grade branding & cost-recovery.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors