Skip to content

Future: Integrate RE2 as preferred regex engine (with Boost and std::regex fallbacks) #14

@BrunoO

Description

@BrunoO

Goal

Add RE2 as the preferred regex engine when available, with fallbacks:

  1. RE2 (when available and pattern-compatible)
  2. Boost.Regex (when available)
  3. std::regex (final fallback)

Expected benefits of RE2

  • Performance – Typically 5–10× faster than std::regex for complex patterns and long text (DFA/NFA hybrid, no backtracking).
  • Predictable complexityLinear time O(n) in input size; no exponential blow-up on pathological patterns.
  • ReDoS mitigation – Avoids catastrophic backtracking, reducing risk of regex denial-of-service from user or API-supplied patterns.
  • Thread safety – RE2 objects are safe to share across threads; no extra locking in the regex layer.
  • Bounded memory – Configurable limits; no unbounded memory growth during matching.
  • Production use – Widely used in production; BSD-3-Clause license.

Patterns that need lookahead/lookbehind or backreferences will continue to use Boost or std::regex via the fallback path.

Approach (Option B – package managers)

  • macOS: Homebrew (brew install re2)
  • Linux: vcpkg (vcpkg install re2) alongside existing Boost
  • Windows: initially no RE2; optional vcpkg later for parity

Implementation order

Benchmark strengthening (before RE2)

  1. Regex microbenchmark – New regex_benchmark executable to measure compile + match time for literal, simple, and complex patterns on filename/content corpora. Establish baseline with current std/Boost implementation.
  2. SearchBenchmark extensions – Add --regex-engine=auto|re2|boost|std and regex-heavy reference configs to measure end-to-end impact.

RE2 integration

  1. Build plumbing – CMake detection (find_package(re2 CONFIG QUIET)), HAVE_RE2 / RE2_REGEX_AVAILABLE, CI steps to install RE2 on macOS and Linux. No runtime behavior change.
  2. Unified regex wrapper – Single API with engine priority (RE2 → Boost → std), pattern checks for RE2-unsupported features (lookahead/lookbehind, backrefs), caching.
  3. Tests & benchmarks – Unit tests for selection/fallback; re-run regex_benchmark and search_benchmark to quantify improvement.
  4. Windows RE2 (optional) – vcpkg on Windows, PGO rules for RE2 target.
  5. Documentation – Engine order, CMake options, pattern limitations, how to run benchmarks.

References

  • RE2 feasibility and pattern fallbacks: internal-docs/archive/RE2_FEASIBILITY_STUDY.md (in main repo)
  • Implementation phases and benchmark plan: internal-docs/plans/2026-03-15_RE2_AND_BENCHMARK_PHASES.md (in main repo)

This issue is a placeholder for tracking the above work; no code changes required until implementation starts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions