Skip to content

ieviev/resharp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RE#

crates.io docs.rs

A high-performance, automata-based regex engine with first-class support for intersection and complement operations.

RE# compiles patterns into deterministic automata. All matching is non-backtracking with guaranteed linear-time execution. RE# extends standard regex syntax with intersection (&), complement (~), and a universal wildcard (_), enabling patterns that are impossible or impractical to express with standard regex.

paper | blog post | syntax docs | dotnet version and web playground

Usage

let re = resharp::Regex::new(r".*cat.*&.*dog.*&.{8,15}").unwrap();

let matches = re.find_all(b"the cat and the dog");
let found = re.is_match(b"the cat and the dog");

Syntax extensions

RE# supports standard regex syntax plus three extensions: _ (universal wildcard), & (intersection), and ~(...) (complement). _ matches any character including newlines, so _* means "any string".

_*              any string
a_*             any string that starts with 'a'
_*a             any string that ends with 'a'
_*a_*           any string that contains 'a'
~(_*a_*)        any string that does NOT contain 'a'
(_*a_*)&~(_*b_*)  contains 'a' AND does not contain 'b'
(?<=b)_*&_*(?=a)  preceded by 'b' AND followed by 'a'

You combine all of these with & to get more complex patterns. RE# also supports lookarounds ((?=...), (?<=...), (?!...), (?<!...)), compiled directly into the automaton with no backtracking. See the full syntax reference for details.

Differences from resharp-dotnet RE# and rust regex

Rust resharp is written from scratch, there are a number of differences from the original. For starters this works on byte slices (&[u8]) and UTF-8 rather than UTF-16. The parser uses the regex-syntax crate as a base with 3 extensions described above. The API is also different, there's different internal representation for the characters and algebra.. etc

While this version is more mature, given it's the second fifth reimplementation, I haven't had that much time to optimize the match loop for specific patterns or use cases. It will still be fast given it's a deterministic automata based engine, but it will be closer to 1GB/s rather than 15GB/s that you can get with SIMD literal optimizations and other pattern-specific optimizations. We do use memrchr for string literals, so literals will actually do 10+GB/s, but don't expect that for more complex patterns.

I have plans to add these optimizations, but it will take some time and effort to properly implement them. If you want to help me out with that I would be grateful, I need some SIMD subroutines that would work across x86-64 and arm64, bidirectionally, and ideally also support no_std environments.. perhaps one more constraint that i want them to be reasonably concise and correct, if I wanted LLM-generated code I would just ask in the first place

In its current state, when you should use this as opposed to the default regex in rust is:

  • you want to match patterns that require intersection or complement or lookarounds
  • you want to match large regexes with high performance, at the expense of memory usage
  • you want leftmost longest matches rather than leftmost first matches

For tuning, EngineOptions controls precompilation threshold, capacity, and lookahead context:

let opts = resharp::EngineOptions {
    dfa_threshold: 100,           // eagerly compile up to N states
    max_dfa_capacity: 65535,       // max automata states (default: u16::MAX)
    lookahead_context_max: 800,    // max lookahead context distance (default: 800)
};
let re = resharp::Regex::with_options(r"pattern", opts).unwrap();

RE# matching API is slightly different from regex, matches will return a Result<Vec<Match>, Error>, where the Error can be a capacity overflow or a lookahead context overflow. RE# will either give you fast matching or fail outright. You can catch these errors and rebuild / adjust your pattern or options accordingly.

Benchmarks

Throughput comparison with regex and fancy-regex on an AMD Ryzen 7 5800X, compiled with --release. Compile time is excluded; only matching is measured. Run with cargo bench -- 'readme/'.

Benchmark resharp regex fancy-regex
dictionary 2663 words (900KB, ~15 matches) 503 MiB/s 547 MiB/s 542 MiB/s
dictionary 2663 words (944KB, ~2678 matches) 424 MiB/s 57 MiB/s 20 MiB/s
dictionary (?i) 2663 words (900KB) 505 MiB/s 0.03 MiB/s 0.03 MiB/s
lookaround (?<=\s)[A-Z][a-z]+(?=\s) (900KB) 265 MiB/s -- 25 MiB/s
literal alternation (900KB) 490 MiB/s 11.3 GiB/s 10.0 GiB/s
literal "Sherlock Holmes" (900KB) 13.8 GiB/s 39.0 GiB/s 32.7 GiB/s

Notes on the results:

  • The first dictionary row is roughly tied - the prose haystack only contains ~15 matches, so the lazy DFA barely explores any states. RE#'s advantage is that its full DFA is smaller, but this isn't visible when most states are never materialized. On longer inputs or denser matches, the other engines will degrade - take lazy-dfa benchmarks with a grain of salt, you will not be matching the exact same string over and over in the real world.
  • The (?i) row shows what happens when the pattern forces regex to fall back from its DFA to an NFA: throughput drops from 547 MiB/s to 0.03 MiB/s. RE# handles case folding in the DFA and maintains full speed. You can increase regex's DFA threshold to avoid this fallback, but only up to a point.
  • RE# compiles lookarounds directly into the automaton - no back-and-forth between forward and backward passes. regex doesn't support lookarounds except for anchors; fancy-regex handles them via backtracking, which is occasionally much slower.
  • Literal patterns and <20 literal alternations are a clear RE# weakness - regex uses Teddy for these, which RE# does not have. On pure literal matching, regex / fancy-regex is 20-80x faster.

Crate structure

Crate Description
resharp engine and public API (resharp-engine)
resharp-algebra algebraic regex tree, constraint solver, nullability analysis
resharp-parser pattern string to AST, extends regex-syntax with RE# operators

And most importantly, have fun! :)

About

RE# - A high-performance, automata based regex engine with first-class support for intersection and complement operations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages