wasm-tokenizer

wasm-tokenizer is a high-performance tokenizer written in C++ and compiled to WebAssembly (WASM) for use in both browser and Node.js environments. It provides efficient encoding and decoding of tokens, making it the most performant tokenizer in its class.

Thanks

Thanks to Claude Sonnet 3.5! In fact, most of the work on this library was done by Anthropic AI on GPTunneL, and now this library is used as the core functionality for calculating service tokens. 🤯

Features

Written in C++ and compiled to WebAssembly
Compatible with browser and Node.js environments
Highly efficient encoding and decoding of tokens
Includes a tool to convert tiktoken file format to binary, reducing the size of cl100k token database by 60%
Developed to enhance performance for token calculation on the frontend for GPTunneL, an AI Aggregator by ScriptHeads

Performance

wasm-tokenizer outperforms other popular tokenizers such as gpt-tokenizer and tiktoken. Here are some performance comparisons:

1,000 Tokens

10,000 Tokens

1,500,000 Tokens

All tests were run on a MacBook M1 Pro using Node.js v21.6.2, with 1000 iterations to calculate average time.

Usage

Node.js Environment

import WASMTokenizer from './WASMTokenizer'

WASMTokenizer().then((tokenizer) => {
  const text = "Hello world";
  const length = tokenizer.count(text);
  const tokens = tokenizer.encode(text);
  const decoded = tokenizer.decode(tokens);

  console.log(decoded, length, 'tokens');
});

Browser runtime

Using wasm-tokenizer in a web environment requires loading the WebAssembly module and the token database. Here's an example of how to use it:

<head>
  <script src="tokenizer.js"></script>
</head>
<body>
  <script>
    fetch('cl100k_base.bin')
    .then(response => response.arrayBuffer())
    .then(arrayBuffer => {
      const uint8Array = new Uint8Array(arrayBuffer);
      wasmTokenizer().then((TokenizerModule) => {
        // Create a vector from the Uint8Array
        const vector = new TokenizerModule.VectorUint8();
        for (let i = 0; i < uint8Array.length; i++) {
          vector.push_back(uint8Array[i]);
        }

        // Create the tokenizer using the vector
        const tokenizer = new TokenizerModule.Tokenizer(vector);
        // Clean up the vector
        vector.delete();

        const text = "Hello world";
        const length = tokenizer.count(text);
        const tokens = tokenizer.encode(text);
        const decoded = tokenizer.decode(tokens);

        console.log(decoded, length, 'tokens');
      });
    });
  </script>
<body>

Binary Conversion Tool

wasm-tokenizer includes a tool to convert tiktoken file format to binary. This conversion reduces the size of the cl100k token database by 60%, further improving performance and reducing resource usage.

Real-world Application

You can see wasm-tokenizer in action on GPTunneL, where it's used to power token calculations for various LLM models. Visit GPTunneL to experience the performance benefits of wasm-tokenizer and explore some of the most powerful language models available.

Contributing

We welcome contributions to wasm-tokenizer! If you'd like to contribute, please follow these steps:

Fork the repository
Create a new branch for your feature or bug fix
Make your changes and commit them with clear, descriptive messages
Push your changes to your fork
Submit a pull request to the main repository

TODO

npm package
add all databases
add chat encode support

License

This project is licensed under the MIT License.

Acknowledgements

wasm-tokenizer was developed by ScriptHeads for use in the GPTunneL AI Aggregator. We thank the open-source community for their continuous support and contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
dist		dist
node_modules		node_modules
scripts		scripts
src		src
.DS_Store		.DS_Store
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wasm-tokenizer

Thanks

Features

Performance

1,000 Tokens

10,000 Tokens

1,500,000 Tokens

Usage

Node.js Environment

Browser runtime

Binary Conversion Tool

Real-world Application

Contributing

TODO

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

script-heads/wasm-tokenizer

Folders and files

Latest commit

History

Repository files navigation

wasm-tokenizer

Thanks

Features

Performance

1,000 Tokens

10,000 Tokens

1,500,000 Tokens

Usage

Node.js Environment

Browser runtime

Binary Conversion Tool

Real-world Application

Contributing

TODO

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages