Skip to content

LXGIC-Studios/ai-tokenizer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ai-tokenizer

npm version License: MIT TypeScript

Token counting and management for LLMs. Count tokens, truncate text, manage budgets. OpenAI, Anthropic, Llama support.

Quick Start

npx ai-tokenizer count "Hello, world!"

Features

  • Accurate token counting - Uses tiktoken for precise counts
  • Smart truncation - Truncate from end, start, or middle
  • Chunking - Split text by tokens, sentences, or paragraphs
  • Budget management - Allocate tokens across system/context/response
  • Statistics - Analyze compression ratios and token usage
  • Multi-model - GPT-4, Claude, Llama, Mistral support

Installation

# Use directly with npx (no install needed)
npx ai-tokenizer count "your text here"

# Or install globally
npm install -g ai-tokenizer

# Or add to your project
npm install ai-tokenizer

CLI Usage

Count Tokens

# Count tokens in text
npx ai-tokenizer count "Hello, how are you?"

# Count tokens in a file
npx ai-tokenizer count --file ./document.txt

# Specify model
npx ai-tokenizer count "Hello" --model gpt-3.5-turbo

# Count message tokens (JSON format)
npx ai-tokenizer count --messages '[{"role":"user","content":"Hi"}]'

Truncate Text

# Truncate to 100 tokens
npx ai-tokenizer truncate "long text..." --tokens 100

# Truncate from start
npx ai-tokenizer truncate "long text..." --tokens 100 --strategy start

# Truncate from middle
npx ai-tokenizer truncate "long text..." --tokens 100 --strategy middle

# Custom ellipsis
npx ai-tokenizer truncate "long text..." --tokens 100 --ellipsis " [...]"

Chunk Text

# Split into 1000-token chunks
npx ai-tokenizer chunk --file ./large-doc.txt --tokens 1000

# With overlap
npx ai-tokenizer chunk --file ./doc.txt --tokens 1000 --overlap 100

# Save chunks to files
npx ai-tokenizer chunk --file ./doc.txt --tokens 1000 --output ./chunks/part

Analyze Text

# Get token statistics
npx ai-tokenizer analyze "Your text here"

# Analyze a file
npx ai-tokenizer analyze --file ./document.txt

Compare Models

# Compare token counts across models
npx ai-tokenizer compare "Your text" --models gpt-4,claude-3-sonnet,llama-2-70b

List Models

# Show all supported models and context windows
npx ai-tokenizer models

Programmatic Usage

import {
  countTokens,
  countMessageTokens,
  truncateToTokens,
  chunkText,
  analyzeText,
  BudgetManager,
  getContextWindow,
} from 'ai-tokenizer';

// Count tokens
const tokens = countTokens("Hello, world!", "gpt-4");
console.log(tokens); // 4

// Count message tokens
const messageTokens = countMessageTokens([
  { role: "system", content: "You are helpful." },
  { role: "user", content: "Hi!" },
], "gpt-4");

// Truncate text
const truncated = truncateToTokens("very long text...", {
  maxTokens: 100,
  strategy: "end",
  ellipsis: "...",
});

// Chunk text
const chunks = chunkText(longDocument, {
  maxTokens: 1000,
  overlap: 100,
});

// Budget management
const budget = new BudgetManager(8000, "gpt-4");
budget.addSystemPrompt("You are a helpful assistant.");
budget.addContext(relevantDocs);
console.log(budget.getRemainingContext()); // tokens left for more context
console.log(budget.getMaxResponseTokens()); // tokens reserved for response

// Analyze text
const stats = analyzeText("Your text here");
console.log(stats.totalTokens);
console.log(stats.compressionRatio);

// Get context window
const contextWindow = getContextWindow("gpt-4-turbo"); // 128000

Supported Models

Model Context Window
gpt-4-turbo 128,000
gpt-4 8,192
gpt-4-32k 32,768
gpt-3.5-turbo 16,385
claude-3-opus 200,000
claude-3-sonnet 200,000
claude-3-haiku 200,000
gemini-1.5-pro 1,000,000
mistral-large 32,000
llama-2-70b 4,096

API Reference

countTokens(text, model?)

Count tokens in text string.

countMessageTokens(messages, model?)

Count tokens in chat messages array (includes overhead).

truncateToTokens(text, options)

Truncate text to fit within token limit.

chunkText(text, options)

Split text into chunks of specified token size.

chunkBySentence(text, maxTokens, model?)

Split text by sentences, respecting token limit.

chunkByParagraph(text, maxTokens, model?)

Split text by paragraphs, respecting token limit.

analyzeText(text, model?)

Get token statistics for text.

BudgetManager

Class for managing token budgets across system/context/response.

getContextWindow(model)

Get context window size for a model.

fitsInContext(text, model, reserveTokens?)

Check if text fits in model's context window.

Part of the LXGIC Dev Toolkit

One of 110+ free developer tools from LXGIC Studios. No paywalls, no sign-ups.

Find more:

License

MIT. Free forever.

About

Count tokens for any LLM model. Supports GPT-4, Claude, Gemini, Llama, Mistral. Truncate, chunk, and manage token budgets. Built on tiktoken for precise counts. CLI and programmatic API.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • TypeScript 61.4%
  • JavaScript 38.6%