Releases: SciSharp/NumSharp
NumSharp 0.41.0-prerelease
This prerelease introduces the IL Kernel Generator -
A complete architectural overhaul that replaces ~600K lines of Regen-generated template code with ~19K lines of runtime IL generation.
This delivers massive performance improvements, comprehensive NumPy 2.x alignment, and significantly cleaner maintainable code.
Installation
dotnet add package NumSharp --version 0.41.0-prereleaseOr via Package Manager:
Install-Package NumSharp -Version 0.41.0-prereleaseTL;DR
- IL Kernel Generator: Runtime IL emission replaces 600K lines of Regen templates with 19K lines
- SIMD everywhere: Vector128/256/512 with runtime detection across all operations
- 35 new functions: nansum/prod/min/max/mean/var/std, cbrt, floor_divide, left/right_shift, deg2rad, rad2deg, cumprod, count_nonzero, isnan, isfinite, isinf, isclose, invert, reciprocal, square, trunc, plus comparison and logical modules
- Operators fixed:
==,!=,<,>,<=,>=,&,|,^ - np.comparison module:
np.equal(),np.not_equal(),np.less(),np.greater(),np.less_equal(),np.greater_equal() - np.logical module:
np.logical_and(),np.logical_or(),np.logical_not(),np.logical_xor() - NDArray<T> operators: Typed
&,|,^for generic arrays (resolvesNDArray<bool>ambiguity) - Math functions rewritten: sin, cos, tan, exp, log, sqrt, abs, sign, floor, ceil, etc.
- 60+ bug fixes: np.negative, np.positive, np.unique, np.dot, np.matmul, np.abs, np.argmax/min, np.mean, np.std/var, np.cumsum, np.nonzero, np.all/any, np.clip, and more
- MatMul 35-100x faster: Cache-blocked SIMD achieving 20+ GFLOPS
- Boolean indexing rewrite: SIMD fast path with CountTrue/CopyMasked
- Axis reductions rewrite: AVX2 gather, NaN-aware, proper keepdims and empty array handling
- Single-threaded execution: Deterministic, non-blocking (SIMD compensates for parallelism), Removed use of
Parallel.* - Architecture cleanup: Broadcasting in Shape struct, TensorEngine routing, static ILKernelGenerator
- np.random aligned (#582): Parameter names match NumPy, Shape overloads added
- DecimalMath internalized (#588): Removed embedded third-party code
- NEP50 compliant: NumPy 2.x type promotion rules
- Benchmark infrastructure: SIMD vs scalar comparison suite
- DefaultEngine dispatch layer: BinaryOp, BitwiseOp, CompareOp, ReductionOp, UnaryOp
- +4,200 unit tests, our own and migrated from python/numpy to C#.
Contents
| Section | Highlights |
|---|---|
| Summary | 106 commits, -533K lines, 3,907 tests |
| IL Kernel Generator | 27 files, SIMD V128/256/512 |
| Architecture | Static ILKernelGenerator, TensorEngine routing |
| New NumPy Functions (35) | nansum, isnan, cumprod, etc. |
| Critical Bug Fixes | negative, unique, dot, linspace, intp |
| Operator Rewrites | ==, !=, <, >, &, | now work |
| Boolean Indexing Rewrite | SIMD fast path, 76 battle tests |
| Slicing Improvements | Broadcast stride=0 preserved |
| Performance Improvements | MatMul 35-100x, 20+ GFLOPS |
| Code Reduction | 99% binary, 98% MatMul, 97% Dot |
| Infrastructure Changes | NativeMemory, static kernels |
| API Alignment | random() params aligned with NumPy |
| New Test Files (68) | 34 kernel, 8 NumPy, 4 linalg, 76 boolean |
| Known Issues | 52 OpenBugs excluded |
| Installation | dotnet add package NumSharp |
Summary
| Metric | Value |
|---|---|
| Commits | 106 |
| Files Changed | 558 |
| Lines Added | +72,635 |
| Lines Deleted | -605,976 |
| Net Change | -533K lines |
| Test Results | 3,907 passed, 52 OpenBugs, 11 skipped |
Detailed Breakdown
Read More
IL Kernel Generator
Runtime IL generation via System.Reflection.Emit.DynamicMethod replaces static Regen templates.
Kernel Files (27 new files)
ILKernelGenerator.cs- Core infrastructure, SIMD detection (Vector128/256/512)ILKernelGenerator.Binary.cs- Add, Sub, Mul, Div, BitwiseAnd/Or/XorILKernelGenerator.MixedType.cs- Mixed-type ops with type promotionILKernelGenerator.Unary.cs- Negate, Abs, Sqrt, Sin, Cos, Exp, Log, SignILKernelGenerator.Comparison.cs- ==, !=, <, >, <=, >= returning bool arraysILKernelGenerator.Reduction.cs- Sum, Prod, Min, Max, Mean, ArgMax, ArgMin, All, AnyILKernelGenerator.Reduction.Axis.Simd.cs- AVX2 gather for axis reductionsILKernelGenerator.Scan.cs- CumSum, CumProd with SIMDILKernelGenerator.Shift.cs- LeftShift, RightShiftILKernelGenerator.MatMul.cs- Cache-blocked SIMD matrix multiplyILKernelGenerator.Clip.cs,.Modf.cs,.Masking.cs- Specialized ops
Execution Paths
- SimdFull - Contiguous + SIMD-capable dtype → Vector loop + scalar tail
- ScalarFull - Contiguous + non-SIMD dtype (Decimal) → Scalar loop
- General - Strided/broadcast → Coordinate-based iteration
Infrastructure
KernelKey.cs,KernelOp.cs,KernelSignatures.cs- Kernel dispatchSimdMatMul.cs- SIMD matrix multiplication helpersTypeRules.cs- NEP50 type promotion rules
Architecture
Clean separation of concerns:
| Component | Design |
|---|---|
ILKernelGenerator |
Static class (27 partial files), internal to DefaultEngine |
TensorEngine |
All np.* ops route through abstract methods |
Shape.Broadcasting |
Pure shape math in Shape struct (456 lines) |
ArgMin/ArgMax |
Unified IL kernel with NaN-aware + Boolean semantics |
DecimalMath |
Internal utility (~403 lines) for Sqrt, Pow, ATan2, Exp, Log |
Single-Threaded Execution
All computation is single-threaded with no Parallel.For usage. This provides:
- Deterministic behavior - Same inputs always produce same outputs in same order
- Non-blocking execution - No thread synchronization overhead
- Simplified debugging - Stack traces are straightforward
- SIMD compensation - Vector128/256/512 intrinsics provide parallelism at the CPU level
Broadcasting External to Engine
Broadcasting logic (Shape.Broadcasting.cs) is pure shape math with no engine dependencies:
Shape.AreBroadcastable()- Check if shapes can broadcastShape.Broadcast()- Compute broadcast result shape and stridesShape.ResolveReturnShape()- Determine output shape for operationsDefaultEnginedelegates all broadcasting toShape.*methods
DecimalMath (#588)
Replaced embedded third-party DecimalEx.cs (~1061 lines) with minimal internal DecimalMath.cs (~403 lines) containing only the functions NumSharp actually uses: Sqrt, Pow, ATan2, Exp, Log, Log10, ATan.
TensorEngine Abstract Methods
Compare, NotEqual, Less, LessEqual, Greater, GreaterEqual, BitwiseAnd, BitwiseOr, BitwiseXor, LeftShift, RightShift, Power(NDArray, NDArray), FloorDivide, Truncate, Reciprocal, Square, Cbrt, Invert, Deg2Rad, Rad2Deg, IsInf, ReduceCumMul, Any, NanSum, NanProd, NanMin, NanMax, BooleanMask
DefaultEngine Dispatch Files (IL kernel integration)
| File | Functions |
|---|---|
DefaultEngine.BinaryOp.cs |
np.add, np.subtract, np.multiply, np.divide, np.mod, np.power |
DefaultEngine.BitwiseOp.cs |
np.bitwise_and, np.bitwise_or, np.bitwise_xor, &, |, ^ |
DefaultEngine.CompareOp.cs |
np.equal, np.not_equal, np.less, np.greater, np.less_equal, np.greater_equal |
DefaultEngine.ReductionOp.cs |
np.sum, np.prod, np.min, np.max, np.mean, np.std, np.var, np.argmax, np.argmin |
DefaultEngine.UnaryOp.cs |
np.abs, np.negative, np.sqrt, np.sin, np.cos, np.exp, np.log, np.sign, etc. |
Implementation Files
Default.Any.cs, Default.BooleanMask.cs, Default.Reduction.Nan.cs, Shape.Broadcasting.cs
New NumPy Functions (35)
NaN-Aware Reductions (7)
| Function | Description |
|---|---|
np.nansum |
Sum ignoring NaN |
np.nanprod |
Product ignoring NaN |
np.nanmin |
Minimum ignoring NaN |
np.nanmax |
Maximum ignoring NaN |
np.nanmean |
Mean ignoring NaN |
np.nanvar |
Variance ignoring NaN |
np.nanstd |
Standard deviation ignoring NaN |
Math Operations (8)
| Function | Description |
|---|---|
np.cbrt |
Cube root |
np.floor_divide |
Integer division |
np.reciprocal |
Element-wise 1/x |
np.trunc |
Truncate to integer |
np.invert |
Bitwise NOT |
np.square |
Element-wise square |
np.cumprod |
Cumulative product |
np.count_nonzero |
Count non-zero elements |
Bitwise & Trigonometric (4)
| Function | Description |
|---|---|
np.left_shift |
Bitwise left shift |
np.right_shift |
Bitwise right shift |
np.deg2rad |
Degrees to radians |
np.rad2deg |
Radians to degrees |
Logic & Validation (4) - Previously returned null
| Function | Description |
|---|---|
np.isnan |
Test element-wise for NaN |
np.isfinite |
Test element-wise for finiteness |
np.isinf |
Test element-wise for infinity |
np.isclose |
Element-wise comparison within tolerance |
Operators (2) - Previously returned null
| Operator | Description |
|---|---|
operator & |
Bitwise/logical AND with broadcasting |
operator | |
Bitwise/logical OR with broadcasting |
Comparison Functions (6) - New named AP...
v0.4.0-alpha1
NumSharp v0.4.0-alpha1
See #538 for information.
NuGet
No nuget release this preview version.
What's Changed
- Enabled NDArray boolean comparisons for LessThan, GreaterThan, and … by @Rikki-Tavi in #395
- Added data types in np.frombuffer. in #425
- F# in README by @dsyme in #432
- Added support for user defined decimal precision for np.around() and TensorEngine.Round() by @shashi4u in #453
- NumSharp.Bitmap support for odd sized bitmaps with odd sized bytes per pixel by @AmbachtIT in #460
- Fixing the consistency of seed in the random choice. by @bojake in #489
- (Logics):add high performance logical AND function with axis an… by @zhuoshui-AI in #525
- Upgrade target frameworks to net8.0;net10.0 by @Nucs in #532
- Add GitHub Actions CI/CD pipeline by @Nucs in #534
- Fix: skip Bitmap tests on non-Windows CI by @Nucs in #535
- docs: relocate website to docs/website/ by @Nucs in #557
- docs: move docfx_project to docs/website-src by @Nucs in #558
- feat(docs): upgrade to DocFX v2 modern template by @Nucs in #562
New Contributors
Many of the contributer's merges were piggybacked by this release and was probably not entirely intentional.
- @Rikki-Tavi made their first contribution in #395
- @dsyme made their first contribution in #432
- @shashi4u made their first contribution in #453
- @AmbachtIT made their first contribution in #460
- @bojake made their first contribution in #489
- @zhuoshui-AI made their first contribution in #525
Full Changelog: 0.20.5...v0.4.0-alpha1
v0.20.5
- NDArray.Indexing: Rewrite of the getter mechanism, NDArray getter now supports combining 'NDArray, Slice, string, int, bool' in the same slice.
- NDArray.Indexing: Added support for indexing with unmanaged array of indices: ndarray[int* pointer, int length], nd.GetData(int*, int), etc..
- NDArray.Broadcasting: fixed multiple issues.
- NDArray.Slicing: Added support for slicing a broadcasted NDArray.
- Added NPTypeCode.Float as an alias to NPTypeCode.Single
- Extending NPY and fixing NPZ (Thanks Matthew Moloney)
- Added NDArray.AsOrMakeGeneric()
- Added np.nonzero. np.maximum, np.minimum, np.all, np.any
- Arrays.cs: perf-optted Arrays.Slice
- NDArray.FromMultiDimArray: Fixed #367
- np.clip: Added @out argument
- Added np.array(IEnumerable) and np.array(IEnumerable, int size) which is faster.
- np.broadcast_to: added additional overloads.
v0.20.4
Changes
- Added np.transpose, np.swapaxes, ndarray.T, np.moveaxis, np.rollaxis, np.size, np.copyto.
- Added np.ceil, np.arccos, np.floor, np.modf, np.square, np.round, np.sign, np.arcsin, np.arctan.
- Added np.random.*: beta, gamma, bernoulli, binomial, lognormal, normal, poisson, chisquare, geometric.
- Added support for
np.newaxis,...(ellipsis) in a slice. - Performance optimization for np.array, np.linspace, Randomizer class and all np.random.* methods.
Bug Fixes
- ndarray.view copying when it shouldn't.
- couple of ambiguous methods
Obsoletion
nd.Unsafe.Shapeis now obsolete in favor ofnd.Shape.
Special thanks to @henon and @deepakkumar1984 for a PRing great portion of this release.
v0.20.3
v0.10-slice
release signed assembly v0.10.6.
v0.7 works with TensorFlow.NET
v0.7-tensorflow Merge branch 'master' of https://github.com/Oceania2018/NumSharp
v0.6 Supports LAPACK
Merge pull request #162 from dotChris90/master Extend doc and generated new API docs
v0.5-dtype
release v0.5
v0.4
released v0.4