tokenizer: full pipeline + benchmarks by gabewillen · Pull Request #4 · stateforward/emel.cpp

gabewillen · 2026-02-22T06:58:32Z

Summary

implement tokenizer bind flow, pretokenized BPE path, and encoder sm_any alignment
add tokenizer BPE split implementation and expanded tokenizer/encoder tests
add full tokenizer pipeline benchmarks and update snapshots/docs

Testing

./scripts/quality_gates.sh

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebac1cff58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/emel/tokenizer/preprocessor/actions.hpp

src/emel/tokenizer/bpe/split.hpp

Copilot

Pull request overview

This PR completes a “full tokenizer pipeline” flow by introducing an explicit tokenizer bind phase, integrating a new BPE split/byte-encode path, and adding end-to-end tokenizer benchmarks plus expanded unit tests and docs/snapshots.

Changes:

Add tokenizer::event::bind and update tokenizer SM to require binding before tokenization.
Implement BPE split/byte-encoding utility (bpe/split.hpp) and route both preprocessor and BPE encoder through it (including a pretokenized encoder path).
Add full tokenizer pipeline benchmarks and new tokenizer/BPE-focused tests; refresh benchmark/doc snapshots.

Reviewed changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tools/bench/tokenizer_preprocessor_bpe_bench.cpp	Switch reference BPE splitting to new `bpe::split` helper.
tools/bench/tokenizer_bench.cpp	Add end-to-end tokenizer benchmark cases across tokenizer models.
tools/bench/bench_main.cpp	Register new tokenizer benchmark cases in bench runner.
tools/bench/bench_cases.hpp	Expose new tokenizer benchmark case appenders.
tools/bench/CMakeLists.txt	Build `tokenizer_bench.cpp` into `bench_runner`.
tests/tokenizer/tokenizer_tests.cpp	New tests for bind+tokenize behavior and error cases.
tests/tokenizer/tokenizer_action_guard_tests.cpp	New tests for tokenizer guards/actions error paths and mappings.
tests/tokenizer/bpe_split_tests.cpp	New tests for BPE split/encode helper behavior and edge cases.
tests/tokenizer/bpe_regex_tests.cpp	New tests validating regex preset mapping via `regex_for`.
tests/encoder/encoder_tests.cpp	Remove BPE-regex assignment tests now that regex assignment API changed.
src/emel/tokenizer/sm.hpp	Redesign tokenizer SM states; add explicit bind flow and `process_event(bind)`.
src/emel/tokenizer/preprocessor/context.hpp	Replace cached regex/word vectors with `split_scratch`.
src/emel/tokenizer/preprocessor/actions.hpp	Route BPE partitioning through `split_and_encode_append`.
src/emel/tokenizer/guards.hpp	Require bound-vocab match for tokenization; add bind guard; remove special-cache guards.
src/emel/tokenizer/events.hpp	Add `event::bind` and bind done/error events.
src/emel/tokenizer/context.hpp	Add preprocessor “any” machine + bound state and preprocess kind tracking.
src/emel/tokenizer/bpe/split.hpp	New BPE split + byte-to-unicode encoding implementation with scratch storage.
src/emel/tokenizer/bpe/regex.hpp	Replace mutable “assign regex” API with `regex_for` returning `regex_list`.
src/emel/tokenizer/actions.hpp	Implement bind actions, preprocessor dispatch, and pretokenized BPE encoder usage.
src/emel/encoder/events.hpp	Add `event::encode.pretokenized` flag.
src/emel/encoder/bpe/detail.hpp	Use new split helper and implement pretokenized encode path.
src/emel/encoder/bpe/context.hpp	Replace regex caching with `split_scratch` storage.
src/emel/encoder/bpe/actions.hpp	Reset BPE split scratch on encode begin.
snapshots/quality_gates/timing.txt	Update timing snapshot due to added/changed work.
snapshots/lint/clang_format.txt	Add new source/test files to clang-format snapshot list.
snapshots/bench/benchmarks_compare.txt	Update compare snapshot including new tokenizer cases.
snapshots/bench/benchmarks.txt	Update bench snapshot including new tokenizer cases.
docs/plans/tokenizer.plan.md	Document explicit bind event and pretokenized BPE fragment support.
docs/benchmarks.md	Refresh benchmark table with new cases/results.
docs/architecture/tokenizer_preprocessor_fallback.md	Update architecture diagram/docs to match new preprocessor behavior.
docs/architecture/tokenizer.md	Update tokenizer architecture diagram/docs to match new bind+pipeline SM.
docs/architecture/mermaid/tokenizer_preprocessor_fallback.mmd	Update mermaid diagram for fallback preprocessor.
docs/architecture/mermaid/tokenizer.mmd	Update mermaid diagram for tokenizer SM.
CMakeLists.txt	Add new tokenizer/BPE unit tests to the test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/emel/encoder/bpe/detail.hpp

tools/bench/tokenizer_bench.cpp

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c9819f820

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-22T16:43:25Z

tools/bench/tokenizer_bench.cpp

+
+void append_reference_tokenizer_cases(std::vector<result> & results, const config & cfg) {
+  // Reference tokenizer benchmarks reuse the EMEL pipeline until llama.cpp parity is wired.
+  append_emel_tokenizer_cases(results, cfg);


Benchmark tokenizer reference path against llama.cpp

append_reference_tokenizer_cases currently reuses append_emel_tokenizer_cases, so bench_runner --mode=compare reports tokenizer/full_* ratios where both sides are EMEL while bench_main.cpp labels the second side as llama.cpp. This produces misleading benchmark snapshots/docs and can hide real tokenizer regressions because the supposed reference baseline is not actually exercised for these cases.

Useful? React with 👍 / 👎.

gabewillen added 4 commits February 22, 2026 00:57

tokenizer: implement bind and full pipeline

e50c55b

bench: add full tokenizer pipeline cases

8f9993f

chore: update gate snapshots

5715959

docs: update tokenizer plan and architecture

ebac1cf

Copilot AI review requested due to automatic review settings February 22, 2026 06:58

Copilot started reviewing on behalf of gabewillen February 22, 2026 06:58 View session

chatgpt-codex-connector bot reviewed Feb 22, 2026

View reviewed changes

src/emel/tokenizer/preprocessor/actions.hpp Outdated Show resolved Hide resolved

src/emel/tokenizer/bpe/split.hpp Show resolved Hide resolved

Copilot AI reviewed Feb 22, 2026

View reviewed changes

src/emel/encoder/bpe/detail.hpp Outdated Show resolved Hide resolved

tools/bench/tokenizer_bench.cpp Show resolved Hide resolved

tools/bench/tokenizer_bench.cpp Show resolved Hide resolved

gabewillen added 5 commits February 22, 2026 09:01

Add tiny GGUF test model

9873cba

Fix SPM prefix escaping in paritychecker

5b17ea9

Add tiny GGUF models for parity coverage

db80e5d

Track existing GGUF fixtures with LFS

f1d8b97

Add tokenizer parity suite and update benchmarks

1c9819f

gabewillen merged commit 41f2fd9 into main Feb 22, 2026

gabewillen deleted the feature/tokenizer-full branch February 22, 2026 16:34

chatgpt-codex-connector bot reviewed Feb 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

tokenizer: full pipeline + benchmarks#4

tokenizer: full pipeline + benchmarks#4
gabewillen merged 9 commits intomainfrom
feature/tokenizer-full

gabewillen commented Feb 22, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

gabewillen commented Feb 22, 2026

Summary

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant