Release NumSharp 0.41.0-prerelease · SciSharp/NumSharp

This prerelease introduces the IL Kernel Generator -
A complete architectural overhaul that replaces ~600K lines of Regen-generated template code with ~19K lines of runtime IL generation.
This delivers massive performance improvements, comprehensive NumPy 2.x alignment, and significantly cleaner maintainable code.

Installation

dotnet add package NumSharp --version 0.41.0-prerelease

Or via Package Manager:

Install-Package NumSharp -Version 0.41.0-prerelease

TL;DR

IL Kernel Generator: Runtime IL emission replaces 600K lines of Regen templates with 19K lines
SIMD everywhere: Vector128/256/512 with runtime detection across all operations
35 new functions: nansum/prod/min/max/mean/var/std, cbrt, floor_divide, left/right_shift, deg2rad, rad2deg, cumprod, count_nonzero, isnan, isfinite, isinf, isclose, invert, reciprocal, square, trunc, plus comparison and logical modules
Operators fixed: ==, !=, <, >, <=, >=, &, |, ^
np.comparison module: np.equal(), np.not_equal(), np.less(), np.greater(), np.less_equal(), np.greater_equal()
np.logical module: np.logical_and(), np.logical_or(), np.logical_not(), np.logical_xor()
NDArray<T> operators: Typed &, |, ^ for generic arrays (resolves NDArray<bool> ambiguity)
Math functions rewritten: sin, cos, tan, exp, log, sqrt, abs, sign, floor, ceil, etc.
60+ bug fixes: np.negative, np.positive, np.unique, np.dot, np.matmul, np.abs, np.argmax/min, np.mean, np.std/var, np.cumsum, np.nonzero, np.all/any, np.clip, and more
MatMul 35-100x faster: Cache-blocked SIMD achieving 20+ GFLOPS
Boolean indexing rewrite: SIMD fast path with CountTrue/CopyMasked
Axis reductions rewrite: AVX2 gather, NaN-aware, proper keepdims and empty array handling
Single-threaded execution: Deterministic, non-blocking (SIMD compensates for parallelism), Removed use of Parallel.*
Architecture cleanup: Broadcasting in Shape struct, TensorEngine routing, static ILKernelGenerator
np.random aligned (#582): Parameter names match NumPy, Shape overloads added
DecimalMath internalized (#588): Removed embedded third-party code
NEP50 compliant: NumPy 2.x type promotion rules
Benchmark infrastructure: SIMD vs scalar comparison suite
DefaultEngine dispatch layer: BinaryOp, BitwiseOp, CompareOp, ReductionOp, UnaryOp
+4,200 unit tests, our own and migrated from python/numpy to C#.

Section	Highlights
Summary	106 commits, -533K lines, 3,907 tests
IL Kernel Generator	27 files, SIMD V128/256/512
Architecture	Static ILKernelGenerator, TensorEngine routing
New NumPy Functions (35)	nansum, isnan, cumprod, etc.
Critical Bug Fixes	negative, unique, dot, linspace, intp
Operator Rewrites	==, !=, <, >, &, \| now work
Boolean Indexing Rewrite	SIMD fast path, 76 battle tests
Slicing Improvements	Broadcast stride=0 preserved
Performance Improvements	MatMul 35-100x, 20+ GFLOPS
Code Reduction	99% binary, 98% MatMul, 97% Dot
Infrastructure Changes	NativeMemory, static kernels
API Alignment	random() params aligned with NumPy
New Test Files (68)	34 kernel, 8 NumPy, 4 linalg, 76 boolean
Known Issues	52 OpenBugs excluded
Installation	`dotnet add package NumSharp`

Summary

Metric	Value
Commits	106
Files Changed	558
Lines Added	+72,635
Lines Deleted	-605,976
Net Change	-533K lines
Test Results	3,907 passed, 52 OpenBugs, 11 skipped

Detailed Breakdown

IL Kernel Generator

Runtime IL generation via System.Reflection.Emit.DynamicMethod replaces static Regen templates.

Kernel Files (27 new files)

ILKernelGenerator.cs - Core infrastructure, SIMD detection (Vector128/256/512)
ILKernelGenerator.Binary.cs - Add, Sub, Mul, Div, BitwiseAnd/Or/Xor
ILKernelGenerator.MixedType.cs - Mixed-type ops with type promotion
ILKernelGenerator.Unary.cs - Negate, Abs, Sqrt, Sin, Cos, Exp, Log, Sign
ILKernelGenerator.Comparison.cs - ==, !=, <, >, <=, >= returning bool arrays
ILKernelGenerator.Reduction.cs - Sum, Prod, Min, Max, Mean, ArgMax, ArgMin, All, Any
ILKernelGenerator.Reduction.Axis.Simd.cs - AVX2 gather for axis reductions
ILKernelGenerator.Scan.cs - CumSum, CumProd with SIMD
ILKernelGenerator.Shift.cs - LeftShift, RightShift
ILKernelGenerator.MatMul.cs - Cache-blocked SIMD matrix multiply
ILKernelGenerator.Clip.cs, .Modf.cs, .Masking.cs - Specialized ops

Execution Paths

SimdFull - Contiguous + SIMD-capable dtype → Vector loop + scalar tail
ScalarFull - Contiguous + non-SIMD dtype (Decimal) → Scalar loop
General - Strided/broadcast → Coordinate-based iteration

Infrastructure

KernelKey.cs, KernelOp.cs, KernelSignatures.cs - Kernel dispatch
SimdMatMul.cs - SIMD matrix multiplication helpers
TypeRules.cs - NEP50 type promotion rules

Architecture

Clean separation of concerns:

Component	Design
`ILKernelGenerator`	Static class (27 partial files), internal to `DefaultEngine`
`TensorEngine`	All `np.*` ops route through abstract methods
`Shape.Broadcasting`	Pure shape math in `Shape` struct (456 lines)
`ArgMin/ArgMax`	Unified IL kernel with NaN-aware + Boolean semantics
`DecimalMath`	Internal utility (~403 lines) for Sqrt, Pow, ATan2, Exp, Log

Single-Threaded Execution

All computation is single-threaded with no Parallel.For usage. This provides:

Deterministic behavior - Same inputs always produce same outputs in same order
Non-blocking execution - No thread synchronization overhead
Simplified debugging - Stack traces are straightforward
SIMD compensation - Vector128/256/512 intrinsics provide parallelism at the CPU level

Broadcasting External to Engine

Broadcasting logic (Shape.Broadcasting.cs) is pure shape math with no engine dependencies:

Shape.AreBroadcastable() - Check if shapes can broadcast
Shape.Broadcast() - Compute broadcast result shape and strides
Shape.ResolveReturnShape() - Determine output shape for operations
DefaultEngine delegates all broadcasting to Shape.* methods

DecimalMath (#588)

Replaced embedded third-party DecimalEx.cs (~1061 lines) with minimal internal DecimalMath.cs (~403 lines) containing only the functions NumSharp actually uses: Sqrt, Pow, ATan2, Exp, Log, Log10, ATan.

TensorEngine Abstract Methods

Compare, NotEqual, Less, LessEqual, Greater, GreaterEqual, BitwiseAnd, BitwiseOr, BitwiseXor, LeftShift, RightShift, Power(NDArray, NDArray), FloorDivide, Truncate, Reciprocal, Square, Cbrt, Invert, Deg2Rad, Rad2Deg, IsInf, ReduceCumMul, Any, NanSum, NanProd, NanMin, NanMax, BooleanMask

DefaultEngine Dispatch Files (IL kernel integration)

File	Functions
`DefaultEngine.BinaryOp.cs`	`np.add`, `np.subtract`, `np.multiply`, `np.divide`, `np.mod`, `np.power`
`DefaultEngine.BitwiseOp.cs`	`np.bitwise_and`, `np.bitwise_or`, `np.bitwise_xor`, `&`, `\|`, `^`
`DefaultEngine.CompareOp.cs`	`np.equal`, `np.not_equal`, `np.less`, `np.greater`, `np.less_equal`, `np.greater_equal`
`DefaultEngine.ReductionOp.cs`	`np.sum`, `np.prod`, `np.min`, `np.max`, `np.mean`, `np.std`, `np.var`, `np.argmax`, `np.argmin`
`DefaultEngine.UnaryOp.cs`	`np.abs`, `np.negative`, `np.sqrt`, `np.sin`, `np.cos`, `np.exp`, `np.log`, `np.sign`, etc.

Implementation Files

Default.Any.cs, Default.BooleanMask.cs, Default.Reduction.Nan.cs, Shape.Broadcasting.cs

New NumPy Functions (35)

NaN-Aware Reductions (7)

Function	Description
`np.nansum`	Sum ignoring NaN
`np.nanprod`	Product ignoring NaN
`np.nanmin`	Minimum ignoring NaN
`np.nanmax`	Maximum ignoring NaN
`np.nanmean`	Mean ignoring NaN
`np.nanvar`	Variance ignoring NaN
`np.nanstd`	Standard deviation ignoring NaN

Math Operations (8)

Function	Description
`np.cbrt`	Cube root
`np.floor_divide`	Integer division
`np.reciprocal`	Element-wise 1/x
`np.trunc`	Truncate to integer
`np.invert`	Bitwise NOT
`np.square`	Element-wise square
`np.cumprod`	Cumulative product
`np.count_nonzero`	Count non-zero elements

Bitwise & Trigonometric (4)

Function	Description
`np.left_shift`	Bitwise left shift
`np.right_shift`	Bitwise right shift
`np.deg2rad`	Degrees to radians
`np.rad2deg`	Radians to degrees

Logic & Validation (4) - Previously returned `null`

Function	Description
`np.isnan`	Test element-wise for NaN
`np.isfinite`	Test element-wise for finiteness
`np.isinf`	Test element-wise for infinity
`np.isclose`	Element-wise comparison within tolerance

Operators (2) - Previously returned `null`

Operator	Description
`operator &`	Bitwise/logical AND with broadcasting
`operator \|`	Bitwise/logical OR with broadcasting

Comparison Functions (6) - New named API

Function	Description
`np.equal`	Element-wise equality (wraps `==`)
`np.not_equal`	Element-wise inequality (wraps `!=`)
`np.less`	Element-wise less than (wraps `<`)
`np.greater`	Element-wise greater than (wraps `>`)
`np.less_equal`	Element-wise less or equal (wraps `<=`)
`np.greater_equal`	Element-wise greater or equal (wraps `>=`)

Logical Functions (4) - New named API

Function	Description
`np.logical_and`	Element-wise logical AND
`np.logical_or`	Element-wise logical OR
`np.logical_not`	Element-wise logical NOT
`np.logical_xor`	Element-wise logical XOR

New Overloads

Function	New Capability
`np.power(array, array)`	Array exponents (was scalar only)
`np.repeat(array, NDArray)`	Per-element repeat counts
`np.argmax/argmin(axis, keepdims)`	keepdims parameter
`np.convolve`	Complete rewrite (was throwing NRE)

Critical Bug Fixes

Behavioral Fixes

Bug	Before	After
`np.negative()`	Only negated positive values (`if val > 0`)	Negates ALL values (`val = -val`)
`np.positive()`	Applied `abs()`	Identity operation (returns input unchanged)
`np.unique()`	Returned unsorted	Sorts output, NaN at end
`np.dot(1D, 2D)`	Threw `NotSupportedException`	Treats 1D as row vector
`np.dot()` non-contiguous	Failed on strided arrays	Works with all memory layouts
`np.matmul()` broadcast	Crashed with >2D arrays	Full broadcasting support
`np.linspace()`	Returned `float32` for float inputs	Always `float64` default
`np.arange()`	Threw on `start >= stop`	Returns empty array
`np.searchsorted()`	No scalar support	Added scalar overloads returning `int`
`np.shuffle()`	Non-standard `passes` parameter	NumPy legacy API (axis-0 only)
`np.moveaxis()`	Broken	Verified working
`np.argsort()`	NaN handling incorrect	NaN-aware sorting
`np.intp`	Mapped to `int` (always 32-bit)	Uses `nint` (native-sized integer)
`np.uintp`	Not defined	Added as `nuint` (native unsigned)
`np.LogicalNot()`	Changed dtype	Preserves Boolean type
Float-to-int conversion	Used rounding	Uses truncation toward zero

Return Type Fixes

Function	Before	After
`np.argmax()` / `np.argmin()`	Returned `int`	Returns `long` (large array support)
`np.abs()`	Converted to Double	Preserves input dtype

Empty Array Handling

Function	Before	After
`np.mean([])`	Threw or returned 0	Returns `NaN`
`np.mean(zeros((0,3)), axis=0)`	Incorrect	`[NaN, NaN, NaN]`
`np.mean(zeros((0,3)), axis=1)`	Incorrect	Empty array `[]`
`np.std/var` single element	Returned 0	Returns `NaN` with `ddof >= size`

keepdims Fixes

All reduction functions now properly preserve dimensions when keepdims=True:

np.sum, np.prod, np.mean, np.std, np.var
np.min, np.max, np.argmin, np.argmax

Rewritten Functions (IL kernel migration)

Function	Fix
`np.all()`	SIMD, all 12 dtypes (was boolean-only)
`np.any()`	SIMD with early-exit; axis parameter fixed (was always throwing)
`np.sum()`	Axis reduction for broadcast arrays
`np.cumsum()`	Axis support with SIMD, 4K lines Regen removed
`np.cumprod()`	Axis support with SIMD
`np.nonzero()`	Unified IL approach
`np.clip()`	IL kernel rewrite

Math Functions (IL migration)

All migrated from Regen templates to IL kernels with SIMD:

Trig: sin, cos, tan, sinh, cosh, tanh, arcsin, arccos, arctan, arctan2
Exp/Log: exp, exp2, expm1, log, log2, log10, log1p
Other: sqrt, abs, sign, floor, ceil, round

Operator Rewrites

Comparison Operators (==, !=, <, >, <=, >=)

Before: Manual type switch per dtype
After: Uses TensorEngine with IL kernels
Proper null handling (returns false scalar)
Empty array handling (returns empty bool array)
Added reverse operators (object op NDArray)
Full broadcasting support

Bitwise Operators (&, |, ^)

Before: Returned null
After: Full implementation via IL kernels
Added NDArray<T> typed operators
Scalar overloads for all integer types

Implicit Scalar Conversion

Before: (int)ndarray_float64 would fail
After: Uses Converts.ChangeType for cross-dtype conversion

Boolean Indexing Rewrite

Complete rewrite with NumPy-aligned behavior:

Two Cases Supported

arr[mask] where mask.shape == arr.shape → element-wise selection
arr[mask] where mask is 1D and mask.shape[0] == arr.shape[0] → axis-0 selection

SIMD Fast Path

New BooleanMaskFastPath for contiguous arrays
CountTrue(bool*, int) - SIMD count of true values
CopyMasked<T>(src, mask, dest, size) - SIMD masked copy

Slicing Improvements

Broadcast Array Handling

Before: Slicing broadcast arrays would materialize data (losing stride=0)
After: Preserves stride=0 information (NumPy behavior)
Critical for cumsum and axis reductions on broadcast arrays

Empty Slice Handling

a[100:200] on 10-element array now returns proper empty array

Contiguous Optimization

Contiguous slices get fresh shape with offset=0
IsSliced=false for contiguous slices

Performance Improvements

Operation	Improvement	Details
MatMul (2D)	35-100x	Cache-blocked SIMD, 20+ GFLOPS
Axis Reductions	Major	AVX2 gather + parallel outer loop
All/Any	Major	SIMD with early-exit
CumSum/CumProd	Major	Element-wise SIMD
Boolean Masking	Major	SIMD CountTrue + CopyMasked
Integer Abs/Sign	Minor	Bitwise (branchless)
Vector512	New	Runtime detection and utilization
Loop Unrolling	4x	All SIMD kernels

Code Reduction

Massive File Deletions

Component	Before	After	Reduction
Binary ops (Add/Sub/Mul/Div/Mod)	60 files, ~500K lines	2 IL files	99%
`Default.MatMul.2D2D.cs`	~20K lines	325 lines	98.4%
`Default.Dot.NDMD.cs`	~16K lines	422 lines	97.4%
Comparison ops (Equals)	13 files	1 IL file	92%
Std/Var reductions	~20K lines	~500 lines	97%

Deleted Files (76)

60 binary op files (Default.Add.{Type}.cs, etc.)
13 comparison files (Default.Equals.{Type}.cs, etc.)
3 template files

Infrastructure Changes

Memory Allocation

Marshal.AllocHGlobal → NativeMemory.Alloc
Marshal.FreeHGlobal → NativeMemory.Free
AllocationType.AllocHGlobal → AllocationType.Native
StackedMemoryPool migrated to NativeMemory

DefaultEngine

ILKernelGenerator is a static class (internal to DefaultEngine)
Single-threaded execution (no Parallel.For)

Math Functions

All migrated from Regen templates to ExecuteUnaryOp:

Sin, Cos, Tan, ASin, ACos, ATan, ATan2
Exp, Exp2, Expm1, Log, Log2, Log10, Log1p
Sqrt, Cbrt, Abs, Sign, Floor, Ceil, Truncate
Removed DecimalMath dependency for most operations

TensorEngine Extensions

New abstract methods (28 total):

Comparison: Compare, NotEqual, Less, LessEqual, Greater, GreaterEqual
Bitwise: BitwiseAnd, BitwiseOr, BitwiseXor, LeftShift, RightShift
Math: Power(NDArray, NDArray), FloorDivide, Truncate, Reciprocal, Square, Cbrt, Invert, Deg2Rad, Rad2Deg, IsInf
Reduction: ReduceCumMul, Any, NanSum, NanProd, NanMin, NanMax
Indexing: BooleanMask

IKernelProvider Methods

CountTrue(bool*, int) - SIMD true count
CopyMasked<T> - SIMD masked copy
Variance<T>, StandardDeviation<T> - SIMD two-pass
NanSum/Prod/Min/Max for float/double
FindNonZeroStrided<T> - Strided nonzero detection

API Alignment

API	NumPy-Aligned Behavior
`np.random.random()`	Alias for `random_sample()`
`np.random.standard_normal()`	Correct spelling (matches NumPy)
`np.random.*` params	`size`, `a`, `b`, `p`, `d0` (NumPy names)
`np.random.randn/rand/normal`	Accept `Shape` parameter
`np.minimum/maximum`	`dtype` parameter (not `outType`)
`np.modf()`	Validates floating-point input

NumSharp 0.41.0-prerelease