Add CMake build system overlay with cross-platform support and source fixes by SoShiny · Pull Request #71 · marekandreas/elpa

SoShiny · 2026-04-22T18:02:23Z

Motivation

This work originates from Synopsys QuantumATK's need to build ELPA on Windows,
aligning it with the existing Linux version. The CMake build system is added as a
supplemental overlay alongside the existing autotools setup; with some relatively
minor extra work it could serve as a replacement.

Advantages over autotools:

Native Windows support without MSYS2/Cygwin.
Faster configure/generate cycles on large projects.
IDE integration (Visual Studio, CLion, VS Code) out of the box.
Easier to add cross-compilation support and toolchain file support.

While developing the CMake overlay, several in-source issues were discovered and
fixed — ranging from hard bugs (crashes, wrong results) to build/runtime warning
cleanups. ARM (AArch64) and macOS support was explored as well, driven by the
recurring demand to run on Apple Silicon.

What's included

CMake build system (new, additive)

Top-level CMakeLists.txt + modular cmake/ modules covering compiler
options, feature checks, kernel selection, Fortran preprocessing, generated
files, CUDA dependencies, MPI, OpenMP, math libraries, selective symbol
export, install, testing, and packaging.
Python helpers for Fortran preprocessing, interface extraction, export DEF
generation, test generation, and constant processing.
Ready-made build configuration scripts for Linux (GCC, Clang/Flang, AOCC,
Intel ICX/IFX — with MKL, OpenBLAS, BLIS; OpenMPI and Intel MPI; x86‑64 +
AArch64), Windows (Clang-CL + ifort/ifx), and macOS ARM64 (GCC +
Accelerate/OpenBLAS).
Prerequisite checker, build driver, test runner, and validation harness —
primarily for documentation and reproducibility rather than CI-grade automation.

Bug fixes (correctness)

Fix CUDA streams crash: my_stream must be firstprivate in OpenMP regions.
Fix test failures in Release builds caused by assert side effects.
Fix elpa_index sscanf stack corruption and missing returns.
Fix uninitialized variables in cusolver, GPU vendor-agnostic, and solve_tridi
templates.
Fix GPU skew-symmetric bandred column offset.
Fix missing return when Cholesky decomposition fails.
Fix undefined _XOR_EPI in AVX-512 Xeon Phi complex kernel.
Fix NEON AArch64 single-precision _SIMD_NFMA argument order.
Fix single-precision failures from broken Accelerate vecLib auxiliaries.
Fix gpuMemcpy* common-symbol ODR violations in test layer.

Warning and portability fixes

Replace deprecated omp master with omp masked.
Fix printf format specifiers, string literal const-correctness, macro
redefinitions, dead variables, unused parameters, and VLA const qualifiers.
Add inline keyword to __forceinline macro for standard compliance.
Work around flang-new type(c_ptr) bind(C) codegen bug.
Add Windows portability: posix_memalign shim, Complex_I macro, portable
complex type macros, _Generic selections, POSIX-only API guards, DLL symbol
export initialization, and GPU/test/CUDA source adjustments.
Cap test_multiple_objs autotuning loop at 20 SCF steps.

What's missing

The following autotools features are not yet implemented in the CMake overlay:

ROCm (HIP) and SYCL GPU back-ends — only CUDA has been tried.
Doxygen documentation generation (make doc).
make dist / tarball packaging.
Full feature parity for all configure flags (e.g. some niche kernel
selections or experimental options may be missing).

Scope

134 files changed, ~10 300 insertions, ~300 deletions.
The ~9 000 lines of insertions are the CMake overlay and build examples;
the ~1 300 lines touch existing sources (fixes + portability guards).

Replace undefined _XOR_EPI macro call with _SIMD_XOR_EPI in the AVX-512 Xeon Phi path of the complex BLOCK kernel template. Only _SIMD_XOR_EPI is defined. The code compiles when the ifdef is inactive but would fail on Knights Landing processors.

Caller sees stale error code when the Cholesky decomposition fails because the function falls through without returning the failure status. Set a unique error code (141414) and return immediately.

last_stripe_width was computed inside if(useGPU) but used unconditionally on the CPU path. On Linux the stack happened to be zeroed, masking the bug. On Windows the uninitialized value caused wrong stripe decomposition and 2stage eigenvector corruption.

In tridiag_template.F90, my_stream was declared private in the OpenMP parallel directive, leaving each thread with an uninitialized stream handle. Changed to firstprivate so every thread copies the initialized value, preventing access violations in the CUDA driver.

elpa_setup() and elpa_setup_gpu() were called inside assert_elpa_ok() which expands to assert(). With NDEBUG the calls are silently removed, leaving the ELPA handle uninitialized and causing crashes. Move the calls outside the assert.

sscanf "%lf" reads a double into a float* pointer, corrupting the stack. Changed to "%f". Also added missing return values in _enumerate switch functions and a portable _lfind wrapper for Windows.

Initialize my_stream to 0 before the WITH_GPU_STREAMS conditional in solve_tridi_col, solve_tridi_single_problem, and merge_systems templates. Prevents run-time uninitialized-variable errors when GPU streams are disabled.

Initialize return variables (flag=-1, success=.false., version=0) before preprocessor-guarded assignments in GPU vendor-agnostic CCL and BLAS layer templates. Prevents compiler warnings and undefined behavior when compiled without a GPU backend.

Initialize version=0 and success=.false. in GPU vendor-agnostic and cusolver templates before preprocessor-guarded assignments. Suppresses ifort #6178 on non-GPU builds.

Initialize the version result to 0 before the WITH_AMD_ROCSOLVER guard so the variable is always defined on non-ROCm builds.

The umc_dev pointer arithmetic in the GPU skewsymmetric path of bandred used (n_cols+1+1) instead of (n_cols+1-1) for the Fortran-to-C 0-based index conversion. This read 2 columns past the intended position, causing wrong eigenvalues for small matrices and CUDA memory errors for large matrices. The CPU path and GPU symmetric path were correct.

The Fortran variant used an unbounded do while(autotune_step()) loop that exhausts the full ELPA_AUTOTUNE_FAST search space. With CUDA enabled the search space contains hundreds of parameter combinations plus two call sleep(2) per iteration, making the test run for 600+ seconds. The C and C++ variants already cap the loop at 20 SCF steps. Apply the same limit to the Fortran variant and remove the sleep() calls.

Add conditional includes for POSIX-only headers, complex-type wrappers, and Windows-specific guards around the aligned allocation interfaces. All changes are guarded with #ifdef _WIN32 and leave Linux behavior unchanged.

Add platform-aware complex type macros for the SIMD kernel template so clang-cl (which supports C99 _Complex unlike MSVC cl.exe) can compile the complex BLOCK kernels on Windows.

MSVC headers redefine 'complex' as a macro via corecrt_math.h. Use double_complex/float_complex platform macros in _Generic selections to avoid conflicts.

Add alloca.h fallback for C and CUDA tests, use C99 _Complex types directly for ccache+clang-cl compatibility.

Add alloca.h/malloc.h fallback and replace _Complex types with cuComplex types in extern C signatures. nvcc on Windows uses MSVC cl.exe as host compiler which does not support _Complex.

Initialize module-level integer variables to 0 so ifort emits BSS/DATA symbols instead of COMMON. COMMON symbols are not visible to the Windows linker for DLL export.

Use direct function-call assignment instead of module-variable intermediaries in set_gpu_parameters. Skip elpa_ccl_gpu module writes on Windows (NCCL is Linux-only) to avoid cross-DLL data symbol access that lld-link cannot resolve.

The C++ compilation path needs Complex_I defined as std::complex<EV_TYPE>(0.0,1.0) before use. Without it the test fails to compile as C++.

The test includes <unistd.h> and calls sleep() which are POSIX-only. On Windows, map sleep() to Win32 Sleep() via <windows.h>.

The _WIN32 branch defined Complex_I as std::complex<EV_TYPE>(0.0,1.0) which is C++ syntax invalid in C mode. MSVC UCRT's _Complex_I is an opaque struct that doesn't support arithmetic operators, so use clang's __builtin_complex(0.0, 1.0) to construct a double _Complex imaginary unit.

Add src/helpers/posix_memalign_compat.c with a Windows-compatible implementation of posix_memalign and a paired elpa_aligned_free helper. All code is guarded with #ifdef _WIN32; Linux builds are unaffected.

clang 21 introduced -Wdefault-const-init-var-unsafe, which fires on const VLAs without initializers. The arrays are only used for their sizeof — dropping const preserves the compile-time size computation and silences the diagnostic.

flang-new (all upstream versions 17–23, including latest dev snapshot) incorrectly dereferences type(c_ptr) arguments in bind(C) calls when an integer(c_intptr_t) interface for the same C function is declared first. This passes the pointer VALUE instead of its ADDRESS, causing SIGSEGV in the NVIDIA driver for all eigensolver GPU tests (533 of 800 failures). Route cuda_malloc{,_host}_cptr and hip_malloc{,_host}_cptr through the intptr variant (integer(c_intptr_t), unaffected) and transfer the result back to type(c_ptr). gfortran, ifx, and AOCC flang are not affected — the workaround is harmless for correct compilers. Upstream: llvm/llvm-project#192655

GCC warns about functions marked __attribute__((always_inline)) that may not be inlinable when the inline keyword is missing. Add it to silence the warning and satisfy the compiler on ARM/aarch64.

OpenMP 5.1 deprecated !$omp master in favour of !$omp masked. Switch the source to the newer directive while keeping HAVE_OMP_MASKED guards so compilers without masked support can continue to use the old path. Detection of HAVE_OMP_MASKED is currently handled in the CMake build.

Use %zu instead of %d for size_t arguments in cusolver error/debug messages. Prevents -Wformat warnings and potentially truncated output on 64-bit platforms.

Add (void) cast for unused negative_or_positive parameter in cuda_copy_skewsymmetric_first_half_q_{double,float}_FromC. The parameter is API-symmetric with the _second_half variants but unused because the minus-kernel always negates unconditionally.

Remove entries_in_sub_matrix, columns_in_sub_matrix, and number_of_entries which are computed but never read.

Add missing #undef before #define ROW_LENGTH in real_128bit_256bit_512bit_BLOCK_template.c. The macro was defined in the kernel-function-name section without a trailing #undef.

Add (char*) cast to string literal passed to elpa_timer.

Missing return statements in GPU vendor-agnostic fallback paths cause -Wreturn-type. printf format specifier uses %d for a size_t-derived index that is long on LP64.

These REAL(C_DATATYPE_KIND) externals shadow the double-precision intrinsic ddot when the template is instantiated for single precision, triggering -Wexternal-interface-mismatch on flang-new. The routines never call ddot — the declarations are unreferenced.

vfmsq_f32(a, b, c) computes a - b*c, but _SIMD_NFMA(a, b, c) must compute c - a*b (matching the non-FMA fallback: c - MUL(a,b)). The double-precision counterpart vfmsq_f64(c, b, a) was already correct; the single-precision variant had the first and third arguments swapped. This caused the NEON_ARCH64_BLOCK4 (and BLOCK6) kernels to produce completely wrong results when operating on single-precision matrices, giving orthogonality errors of order 1e2 instead of 1e-6. BLOCK2 was unaffected as its rank-2 update path does not call _SIMD_NFMA.

…liaries Apple Accelerate vecLib's single-precision LAPACK auxiliary routines SLAMCH and SLAPY2 are non-functional on macOS Sequoia (confirmed 15.5, Apple M4): SLAMCH returns 0.0 for every query, and SLAPY2 returns 0.0 or a large garbage constant for all inputs. The double-precision counterparts DLAMCH and DLAPY2 are unaffected. These two broken routines were the root cause of all 116 single-precision test failures when ELPA is linked against Accelerate: - SLAMCH('E') = 0.0 collapses the D&C deflation tolerance TOL = 8*eps*max(dmax,zmax) to zero in merge_systems, causing wrong deflation decisions and therefore incorrect eigenvalues. - SLAPY2(c,s) = 0.0 (or garbage) produces a zero Givens rotation norm in merge_systems, leading to division by zero / NaN in the deflation step; the same bug corrupts Householder reflector norms in the two-stage QR bandwidth-reduction (elpa_pdgeqrf_template). Replace both calls with portable Fortran intrinsics — epsilon() and hypot() — which are correct on all platforms and are what modern Reference-LAPACK uses internally anyway. With these fixes all 424 tests pass at -O3 against Accelerate.

Bare definitions of gpuMemcpyHostToDevice and gpuMemcpyDeviceToHost in layerVariables.h relied on ELF tentative-definition merging to silently deduplicate the symbols when multiple translation units are linked. GCC made -fno-common the default in GCC 10, which turns such collisions into link errors. COFF (Windows) has never supported tentative definitions and rejects them as hard duplicates. Move the authoritative definitions to layerFunctions.c unconditionally, and replace the bare definitions in both header files with extern declarations. Also move the extern int declarations in layerFunctions.h inside the already-present extern "C" block so that C++ translation units (e.g. test.cpp, which includes test.c) resolve to the C-linkage symbols in elpatest.lib rather than emitting C++-mangled references that can never be satisfied.

Introduce a CMake-based build path without modifying the autotools files. Validated across x86_64 Linux, AArch64 Linux, macOS Apple Silicon, and Windows with multiple compiler, MPI, and math-library combinations — including CUDA where available. See cmake_build_examples/README.md for the full validation matrix, design decisions, and usage instructions.

NVTX v3 is header-only — the old libnvToolsExt.so no longer exists. The CMake detection searched only for the legacy library, which caused a warning when ELPA_NVTX=ON with a modern NVTX3 installation. Detection now prefers NVTX3 headers (nvtx3/nvToolsExt.h) and falls back to the legacy library. A small C shim (src/general/nvtx_impl.c) compiles with NVTX_EXPORT_API to emit the externally-visible nvtxRangePushA / nvtxRangePop symbols that Fortran bind(C) interfaces require at link time.

Fortran complex(kind=8) stack variables are only 8-byte aligned, but the compiler generates movapd (requires 16-byte alignment) when dereferencing the host pointer as double2. Use memcpy instead of direct dereference in gpu_set_e_vec_scale_set_one_store_v_row and gpu_store_u_v_in_uv_vu. Affects 1-stage solver with >1 MPI rank per GPU (useCCL=false path).

…tion The explicit-name C API (elpa_eigenvectors_double, etc.) calls is_device_ptr() at runtime to auto-detect host vs device pointers — unlike the generic C API macro, which resolves to _a_h_a_* at compile time and never touches CUDA. The ROCm backend (elpa_explicit_name_amd_gpu.hip) already handles hipPointerGetAttributes failure gracefully: it prints a warning, clears the error, and returns 0 (host pointer). The CUDA backend instead called exit(1), crashing the process when the driver is missing or too old. This patch makes the CUDA path behave identically to the ROCm path: warn, clear the sticky error with cudaGetLastError(), and return 0.

SoShiny · 2026-04-28T22:40:45Z

        beta = sign(dlapy2(alpha, xnorm), alpha)
 #else
-        beta = sign(slapy2(alpha, xnorm), alpha)
+        ! NOTE: slapy2 is broken in Apple Accelerate vecLib for single precision.


So, apparently Apple's Accelerate library does have an old CLAPACK and a new LAPACK implementation and if one binds the fortran interfaces with the new LAPACK the functions work. The new symbols have $NEWLAPACK and $NEWLAPACK$ILP64 suffixes.

function slapy2(x, y) bind(C, name="slapy2$NEWLAPACK") result(r) import :: c_ptr, c_float type(c_ptr), value :: x, y real(c_float) :: r end function

…a_h_a qDev is declared but never allocated in this function (eigenvalues-only path). The gpu_free was incorrectly copied from elpa_generalized_eigenvectors_a_h_a where qDev is properly allocated. This caused cudaFree: invalid argument on GPU. Introduced in 017e3fc.

SoShiny added 30 commits April 20, 2026 11:11

Fix missing return when Cholesky decomposition fails

ef8114e

Caller sees stale error code when the Cholesky decomposition fails because the function falls through without returning the failure status. Set a unique error code (141414) and return immediately.

Fix elpa_index sscanf stack corruption and missing returns

47387aa

sscanf "%lf" reads a double into a float* pointer, corrupting the stack. Changed to "%f". Also added missing return values in _enumerate switch functions and a portable _lfind wrapper for Windows.

Fix uninitialized my_stream in solve_tridi templates

23d1682

Initialize my_stream to 0 before the WITH_GPU_STREAMS conditional in solve_tridi_col, solve_tridi_single_problem, and merge_systems templates. Prevents run-time uninitialized-variable errors when GPU streams are disabled.

Fix uninitialized return variables in cusolver/GPU templates

9f02123

Initialize version=0 and success=.false. in GPU vendor-agnostic and cusolver templates before preprocessor-guarded assignments. Suppresses ifort #6178 on non-GPU builds.

Fix uninitialized rocsolver_get_version result

0a3db4e

Initialize the version result to 0 before the WITH_AMD_ROCSOLVER guard so the variable is always defined on non-ROCm builds.

Add Windows portability guards for POSIX-only APIs

3b8a74a

Add conditional includes for POSIX-only headers, complex-type wrappers, and Windows-specific guards around the aligned allocation interfaces. All changes are guarded with #ifdef _WIN32 and leave Linux behavior unchanged.

Add clang-cl compatible complex type handling in SIMD kernels

96ca507

Add platform-aware complex type macros for the SIMD kernel template so clang-cl (which supports C99 _Complex unlike MSVC cl.exe) can compile the complex BLOCK kernels on Windows.

Use portable complex type macros in _Generic selections

962b17c

MSVC headers redefine 'complex' as a macro via corecrt_math.h. Use double_complex/float_complex platform macros in _Generic selections to avoid conflicts.

Add Windows portability for test sources

a7767b8

Add alloca.h fallback for C and CUDA tests, use C99 _Complex types directly for ccache+clang-cl compatibility.

Add Windows portability for CUDA sources

7c5e021

Add alloca.h/malloc.h fallback and replace _Complex types with cuComplex types in extern C signatures. nvcc on Windows uses MSVC cl.exe as host compiler which does not support _Complex.

Initialize Fortran module variables for DLL symbol export

164fb58

Initialize module-level integer variables to 0 so ifort emits BSS/DATA symbols instead of COMMON. COMMON symbols are not visible to the Windows linker for DLL export.

Add Complex_I define for C++ skewsymmetric test

9a514a3

The C++ compilation path needs Complex_I defined as std::complex<EV_TYPE>(0.0,1.0) before use. Without it the test fails to compile as C++.

Add Windows portability for test_multiple_objs.c

707d04b

The test includes <unistd.h> and calls sleep() which are POSIX-only. On Windows, map sleep() to Win32 Sleep() via <windows.h>.

Add posix_memalign shim for Windows CMake builds

5ea61e2

Add src/helpers/posix_memalign_compat.c with a Windows-compatible implementation of posix_memalign and a paired elpa_aligned_free helper. All code is guarded with #ifdef _WIN32; Linux builds are unaffected.

Add inline keyword to __forceinline macro

1bfbf79

GCC warns about functions marked __attribute__((always_inline)) that may not be inlinable when the inline keyword is missing. Add it to silence the warning and satisfy the compiler on ARM/aarch64.

Fix printf format specifiers in cusolver template

1e70733

Use %zu instead of %d for size_t arguments in cusolver error/debug messages. Prevents -Wformat warnings and potentially truncated output on 64-bit platforms.

Remove dead variables in distribute_global_column_kernel

53e74e6

Remove entries_in_sub_matrix, columns_in_sub_matrix, and number_of_entries which are computed but never read.

SoShiny added 11 commits April 20, 2026 17:57

Fix ROW_LENGTH macro redefinition warning

2eec971

Add missing #undef before #define ROW_LENGTH in real_128bit_256bit_512bit_BLOCK_template.c. The macro was defined in the kernel-function-name section without a trailing #undef.

Fix string literal to non-const char* warning in test

8a700d5

Add (char*) cast to string literal passed to elpa_timer.

Fix warnings in test C sources

f31f6e0

Missing return statements in GPU vendor-agnostic fallback paths cause -Wreturn-type. printf format specifier uses %d for a size_t-derived index that is long on LP64.

SoShiny commented Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CMake build system overlay with cross-platform support and source fixes#71

Add CMake build system overlay with cross-platform support and source fixes#71
SoShiny wants to merge 42 commits into
marekandreas:masterfrom
SoShiny:add_cmake_build_system

SoShiny commented Apr 22, 2026

Uh oh!

SoShiny Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SoShiny commented Apr 22, 2026

Motivation

What's included

CMake build system (new, additive)

Bug fixes (correctness)

Warning and portability fixes

What's missing

Scope

Uh oh!

SoShiny Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant