Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ebac1cff58
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR completes a “full tokenizer pipeline” flow by introducing an explicit tokenizer bind phase, integrating a new BPE split/byte-encode path, and adding end-to-end tokenizer benchmarks plus expanded unit tests and docs/snapshots.
Changes:
- Add
tokenizer::event::bindand update tokenizer SM to require binding before tokenization. - Implement BPE split/byte-encoding utility (
bpe/split.hpp) and route both preprocessor and BPE encoder through it (including apretokenizedencoder path). - Add full tokenizer pipeline benchmarks and new tokenizer/BPE-focused tests; refresh benchmark/doc snapshots.
Reviewed changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/bench/tokenizer_preprocessor_bpe_bench.cpp | Switch reference BPE splitting to new bpe::split helper. |
| tools/bench/tokenizer_bench.cpp | Add end-to-end tokenizer benchmark cases across tokenizer models. |
| tools/bench/bench_main.cpp | Register new tokenizer benchmark cases in bench runner. |
| tools/bench/bench_cases.hpp | Expose new tokenizer benchmark case appenders. |
| tools/bench/CMakeLists.txt | Build tokenizer_bench.cpp into bench_runner. |
| tests/tokenizer/tokenizer_tests.cpp | New tests for bind+tokenize behavior and error cases. |
| tests/tokenizer/tokenizer_action_guard_tests.cpp | New tests for tokenizer guards/actions error paths and mappings. |
| tests/tokenizer/bpe_split_tests.cpp | New tests for BPE split/encode helper behavior and edge cases. |
| tests/tokenizer/bpe_regex_tests.cpp | New tests validating regex preset mapping via regex_for. |
| tests/encoder/encoder_tests.cpp | Remove BPE-regex assignment tests now that regex assignment API changed. |
| src/emel/tokenizer/sm.hpp | Redesign tokenizer SM states; add explicit bind flow and process_event(bind). |
| src/emel/tokenizer/preprocessor/context.hpp | Replace cached regex/word vectors with split_scratch. |
| src/emel/tokenizer/preprocessor/actions.hpp | Route BPE partitioning through split_and_encode_append. |
| src/emel/tokenizer/guards.hpp | Require bound-vocab match for tokenization; add bind guard; remove special-cache guards. |
| src/emel/tokenizer/events.hpp | Add event::bind and bind done/error events. |
| src/emel/tokenizer/context.hpp | Add preprocessor “any” machine + bound state and preprocess kind tracking. |
| src/emel/tokenizer/bpe/split.hpp | New BPE split + byte-to-unicode encoding implementation with scratch storage. |
| src/emel/tokenizer/bpe/regex.hpp | Replace mutable “assign regex” API with regex_for returning regex_list. |
| src/emel/tokenizer/actions.hpp | Implement bind actions, preprocessor dispatch, and pretokenized BPE encoder usage. |
| src/emel/encoder/events.hpp | Add event::encode.pretokenized flag. |
| src/emel/encoder/bpe/detail.hpp | Use new split helper and implement pretokenized encode path. |
| src/emel/encoder/bpe/context.hpp | Replace regex caching with split_scratch storage. |
| src/emel/encoder/bpe/actions.hpp | Reset BPE split scratch on encode begin. |
| snapshots/quality_gates/timing.txt | Update timing snapshot due to added/changed work. |
| snapshots/lint/clang_format.txt | Add new source/test files to clang-format snapshot list. |
| snapshots/bench/benchmarks_compare.txt | Update compare snapshot including new tokenizer cases. |
| snapshots/bench/benchmarks.txt | Update bench snapshot including new tokenizer cases. |
| docs/plans/tokenizer.plan.md | Document explicit bind event and pretokenized BPE fragment support. |
| docs/benchmarks.md | Refresh benchmark table with new cases/results. |
| docs/architecture/tokenizer_preprocessor_fallback.md | Update architecture diagram/docs to match new preprocessor behavior. |
| docs/architecture/tokenizer.md | Update tokenizer architecture diagram/docs to match new bind+pipeline SM. |
| docs/architecture/mermaid/tokenizer_preprocessor_fallback.mmd | Update mermaid diagram for fallback preprocessor. |
| docs/architecture/mermaid/tokenizer.mmd | Update mermaid diagram for tokenizer SM. |
| CMakeLists.txt | Add new tokenizer/BPE unit tests to the test target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c9819f820
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| void append_reference_tokenizer_cases(std::vector<result> & results, const config & cfg) { | ||
| // Reference tokenizer benchmarks reuse the EMEL pipeline until llama.cpp parity is wired. | ||
| append_emel_tokenizer_cases(results, cfg); |
There was a problem hiding this comment.
Benchmark tokenizer reference path against llama.cpp
append_reference_tokenizer_cases currently reuses append_emel_tokenizer_cases, so bench_runner --mode=compare reports tokenizer/full_* ratios where both sides are EMEL while bench_main.cpp labels the second side as llama.cpp. This produces misleading benchmark snapshots/docs and can hide real tokenizer regressions because the supposed reference baseline is not actually exercised for these cases.
Useful? React with 👍 / 👎.
Summary
Testing