Skip to content

Comments

tokenizer: full pipeline + benchmarks#4

Merged
gabewillen merged 9 commits intomainfrom
feature/tokenizer-full
Feb 22, 2026
Merged

tokenizer: full pipeline + benchmarks#4
gabewillen merged 9 commits intomainfrom
feature/tokenizer-full

Conversation

@gabewillen
Copy link
Contributor

Summary

  • implement tokenizer bind flow, pretokenized BPE path, and encoder sm_any alignment
  • add tokenizer BPE split implementation and expanded tokenizer/encoder tests
  • add full tokenizer pipeline benchmarks and update snapshots/docs

Testing

  • ./scripts/quality_gates.sh

Copilot AI review requested due to automatic review settings February 22, 2026 06:58
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ebac1cff58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes a “full tokenizer pipeline” flow by introducing an explicit tokenizer bind phase, integrating a new BPE split/byte-encode path, and adding end-to-end tokenizer benchmarks plus expanded unit tests and docs/snapshots.

Changes:

  • Add tokenizer::event::bind and update tokenizer SM to require binding before tokenization.
  • Implement BPE split/byte-encoding utility (bpe/split.hpp) and route both preprocessor and BPE encoder through it (including a pretokenized encoder path).
  • Add full tokenizer pipeline benchmarks and new tokenizer/BPE-focused tests; refresh benchmark/doc snapshots.

Reviewed changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tools/bench/tokenizer_preprocessor_bpe_bench.cpp Switch reference BPE splitting to new bpe::split helper.
tools/bench/tokenizer_bench.cpp Add end-to-end tokenizer benchmark cases across tokenizer models.
tools/bench/bench_main.cpp Register new tokenizer benchmark cases in bench runner.
tools/bench/bench_cases.hpp Expose new tokenizer benchmark case appenders.
tools/bench/CMakeLists.txt Build tokenizer_bench.cpp into bench_runner.
tests/tokenizer/tokenizer_tests.cpp New tests for bind+tokenize behavior and error cases.
tests/tokenizer/tokenizer_action_guard_tests.cpp New tests for tokenizer guards/actions error paths and mappings.
tests/tokenizer/bpe_split_tests.cpp New tests for BPE split/encode helper behavior and edge cases.
tests/tokenizer/bpe_regex_tests.cpp New tests validating regex preset mapping via regex_for.
tests/encoder/encoder_tests.cpp Remove BPE-regex assignment tests now that regex assignment API changed.
src/emel/tokenizer/sm.hpp Redesign tokenizer SM states; add explicit bind flow and process_event(bind).
src/emel/tokenizer/preprocessor/context.hpp Replace cached regex/word vectors with split_scratch.
src/emel/tokenizer/preprocessor/actions.hpp Route BPE partitioning through split_and_encode_append.
src/emel/tokenizer/guards.hpp Require bound-vocab match for tokenization; add bind guard; remove special-cache guards.
src/emel/tokenizer/events.hpp Add event::bind and bind done/error events.
src/emel/tokenizer/context.hpp Add preprocessor “any” machine + bound state and preprocess kind tracking.
src/emel/tokenizer/bpe/split.hpp New BPE split + byte-to-unicode encoding implementation with scratch storage.
src/emel/tokenizer/bpe/regex.hpp Replace mutable “assign regex” API with regex_for returning regex_list.
src/emel/tokenizer/actions.hpp Implement bind actions, preprocessor dispatch, and pretokenized BPE encoder usage.
src/emel/encoder/events.hpp Add event::encode.pretokenized flag.
src/emel/encoder/bpe/detail.hpp Use new split helper and implement pretokenized encode path.
src/emel/encoder/bpe/context.hpp Replace regex caching with split_scratch storage.
src/emel/encoder/bpe/actions.hpp Reset BPE split scratch on encode begin.
snapshots/quality_gates/timing.txt Update timing snapshot due to added/changed work.
snapshots/lint/clang_format.txt Add new source/test files to clang-format snapshot list.
snapshots/bench/benchmarks_compare.txt Update compare snapshot including new tokenizer cases.
snapshots/bench/benchmarks.txt Update bench snapshot including new tokenizer cases.
docs/plans/tokenizer.plan.md Document explicit bind event and pretokenized BPE fragment support.
docs/benchmarks.md Refresh benchmark table with new cases/results.
docs/architecture/tokenizer_preprocessor_fallback.md Update architecture diagram/docs to match new preprocessor behavior.
docs/architecture/tokenizer.md Update tokenizer architecture diagram/docs to match new bind+pipeline SM.
docs/architecture/mermaid/tokenizer_preprocessor_fallback.mmd Update mermaid diagram for fallback preprocessor.
docs/architecture/mermaid/tokenizer.mmd Update mermaid diagram for tokenizer SM.
CMakeLists.txt Add new tokenizer/BPE unit tests to the test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gabewillen gabewillen merged commit 41f2fd9 into main Feb 22, 2026
@gabewillen gabewillen deleted the feature/tokenizer-full branch February 22, 2026 16:34
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c9819f820

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


void append_reference_tokenizer_cases(std::vector<result> & results, const config & cfg) {
// Reference tokenizer benchmarks reuse the EMEL pipeline until llama.cpp parity is wired.
append_emel_tokenizer_cases(results, cfg);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Benchmark tokenizer reference path against llama.cpp

append_reference_tokenizer_cases currently reuses append_emel_tokenizer_cases, so bench_runner --mode=compare reports tokenizer/full_* ratios where both sides are EMEL while bench_main.cpp labels the second side as llama.cpp. This produces misleading benchmark snapshots/docs and can hide real tokenizer regressions because the supposed reference baseline is not actually exercised for these cases.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant