Skip to content

Add CMake build system overlay with cross-platform support and source fixes#71

Open
SoShiny wants to merge 42 commits into
marekandreas:masterfrom
SoShiny:add_cmake_build_system
Open

Add CMake build system overlay with cross-platform support and source fixes#71
SoShiny wants to merge 42 commits into
marekandreas:masterfrom
SoShiny:add_cmake_build_system

Conversation

@SoShiny
Copy link
Copy Markdown

@SoShiny SoShiny commented Apr 22, 2026

Motivation

This work originates from Synopsys QuantumATK's need to build ELPA on Windows,
aligning it with the existing Linux version. The CMake build system is added as a
supplemental overlay alongside the existing autotools setup; with some relatively
minor extra work it could serve as a replacement.

Advantages over autotools:

  • Native Windows support without MSYS2/Cygwin.
  • Faster configure/generate cycles on large projects.
  • IDE integration (Visual Studio, CLion, VS Code) out of the box.
  • Easier to add cross-compilation support and toolchain file support.

While developing the CMake overlay, several in-source issues were discovered and
fixed — ranging from hard bugs (crashes, wrong results) to build/runtime warning
cleanups. ARM (AArch64) and macOS support was explored as well, driven by the
recurring demand to run on Apple Silicon.

What's included

CMake build system (new, additive)

  • Top-level CMakeLists.txt + modular cmake/ modules covering compiler
    options, feature checks, kernel selection, Fortran preprocessing, generated
    files, CUDA dependencies, MPI, OpenMP, math libraries, selective symbol
    export, install, testing, and packaging.
  • Python helpers for Fortran preprocessing, interface extraction, export DEF
    generation, test generation, and constant processing.
  • Ready-made build configuration scripts for Linux (GCC, Clang/Flang, AOCC,
    Intel ICX/IFX — with MKL, OpenBLAS, BLIS; OpenMPI and Intel MPI; x86‑64 +
    AArch64), Windows (Clang-CL + ifort/ifx), and macOS ARM64 (GCC +
    Accelerate/OpenBLAS).
  • Prerequisite checker, build driver, test runner, and validation harness —
    primarily for documentation and reproducibility rather than CI-grade automation.

Bug fixes (correctness)

  • Fix CUDA streams crash: my_stream must be firstprivate in OpenMP regions.
  • Fix test failures in Release builds caused by assert side effects.
  • Fix elpa_index sscanf stack corruption and missing returns.
  • Fix uninitialized variables in cusolver, GPU vendor-agnostic, and solve_tridi
    templates.
  • Fix GPU skew-symmetric bandred column offset.
  • Fix missing return when Cholesky decomposition fails.
  • Fix undefined _XOR_EPI in AVX-512 Xeon Phi complex kernel.
  • Fix NEON AArch64 single-precision _SIMD_NFMA argument order.
  • Fix single-precision failures from broken Accelerate vecLib auxiliaries.
  • Fix gpuMemcpy* common-symbol ODR violations in test layer.

Warning and portability fixes

  • Replace deprecated omp master with omp masked.
  • Fix printf format specifiers, string literal const-correctness, macro
    redefinitions, dead variables, unused parameters, and VLA const qualifiers.
  • Add inline keyword to __forceinline macro for standard compliance.
  • Work around flang-new type(c_ptr) bind(C) codegen bug.
  • Add Windows portability: posix_memalign shim, Complex_I macro, portable
    complex type macros, _Generic selections, POSIX-only API guards, DLL symbol
    export initialization, and GPU/test/CUDA source adjustments.
  • Cap test_multiple_objs autotuning loop at 20 SCF steps.

What's missing

The following autotools features are not yet implemented in the CMake overlay:

  • ROCm (HIP) and SYCL GPU back-ends — only CUDA has been tried.
  • Doxygen documentation generation (make doc).
  • make dist / tarball packaging.
  • Full feature parity for all configure flags (e.g. some niche kernel
    selections or experimental options may be missing).

Scope

134 files changed, ~10 300 insertions, ~300 deletions.
The ~9 000 lines of insertions are the CMake overlay and build examples;
the ~1 300 lines touch existing sources (fixes + portability guards).

SoShiny added 30 commits April 20, 2026 11:11
Replace undefined _XOR_EPI macro call with _SIMD_XOR_EPI in the
AVX-512 Xeon Phi path of the complex BLOCK kernel template.
Only _SIMD_XOR_EPI is defined. The code compiles when the ifdef
is inactive but would fail on Knights Landing processors.
Caller sees stale error code when the Cholesky decomposition fails
because the function falls through without returning the failure status.
Set a unique error code (141414) and return immediately.
last_stripe_width was computed inside if(useGPU) but used
unconditionally on the CPU path. On Linux the stack happened to be
zeroed, masking the bug. On Windows the uninitialized value caused
wrong stripe decomposition and 2stage eigenvector corruption.
In tridiag_template.F90, my_stream was declared private in the OpenMP
parallel directive, leaving each thread with an uninitialized stream
handle. Changed to firstprivate so every thread copies the initialized
value, preventing access violations in the CUDA driver.
elpa_setup() and elpa_setup_gpu() were called inside assert_elpa_ok()
which expands to assert(). With NDEBUG the calls are silently removed,
leaving the ELPA handle uninitialized and causing crashes. Move the
calls outside the assert.
sscanf "%lf" reads a double into a float* pointer, corrupting the
stack. Changed to "%f". Also added missing return values in
_enumerate switch functions and a portable _lfind wrapper for Windows.
Initialize my_stream to 0 before the WITH_GPU_STREAMS conditional in
solve_tridi_col, solve_tridi_single_problem, and merge_systems templates.
Prevents run-time uninitialized-variable errors when GPU streams are
disabled.
Initialize return variables (flag=-1, success=.false., version=0)
before preprocessor-guarded assignments in GPU vendor-agnostic CCL
and BLAS layer templates. Prevents compiler warnings and undefined
behavior when compiled without a GPU backend.
Initialize version=0 and success=.false. in GPU vendor-agnostic
and cusolver templates before preprocessor-guarded assignments.
Suppresses ifort #6178 on non-GPU builds.
Initialize the version result to 0 before the WITH_AMD_ROCSOLVER
guard so the variable is always defined on non-ROCm builds.
The umc_dev pointer arithmetic in the GPU skewsymmetric path of
bandred used (n_cols+1+1) instead of (n_cols+1-1) for the Fortran-to-C
0-based index conversion. This read 2 columns past the intended
position, causing wrong eigenvalues for small matrices and CUDA memory
errors for large matrices. The CPU path and GPU symmetric path were
correct.
The Fortran variant used an unbounded do while(autotune_step()) loop
that exhausts the full ELPA_AUTOTUNE_FAST search space.  With CUDA
enabled the search space contains hundreds of parameter combinations
plus two call sleep(2) per iteration, making the test run for 600+
seconds.

The C and C++ variants already cap the loop at 20 SCF steps.  Apply
the same limit to the Fortran variant and remove the sleep() calls.
Add conditional includes for POSIX-only headers, complex-type wrappers,
and Windows-specific guards around the aligned allocation interfaces.
All changes are guarded with #ifdef _WIN32 and leave Linux behavior
unchanged.
Add platform-aware complex type macros for the SIMD kernel template
so clang-cl (which supports C99 _Complex unlike MSVC cl.exe) can
compile the complex BLOCK kernels on Windows.
MSVC headers redefine 'complex' as a macro via corecrt_math.h.
Use double_complex/float_complex platform macros in _Generic
selections to avoid conflicts.
Add alloca.h fallback for C and CUDA tests, use C99 _Complex types
directly for ccache+clang-cl compatibility.
Add alloca.h/malloc.h fallback and replace _Complex types with
cuComplex types in extern C signatures. nvcc on Windows uses MSVC
cl.exe as host compiler which does not support _Complex.
Initialize module-level integer variables to 0 so ifort emits
BSS/DATA symbols instead of COMMON. COMMON symbols are not
visible to the Windows linker for DLL export.
Use direct function-call assignment instead of module-variable
intermediaries in set_gpu_parameters. Skip elpa_ccl_gpu module
writes on Windows (NCCL is Linux-only) to avoid cross-DLL data
symbol access that lld-link cannot resolve.
The C++ compilation path needs Complex_I defined as
std::complex<EV_TYPE>(0.0,1.0) before use. Without it the test
fails to compile as C++.
The test includes <unistd.h> and calls sleep() which are POSIX-only.
On Windows, map sleep() to Win32 Sleep() via <windows.h>.
The _WIN32 branch defined Complex_I as std::complex<EV_TYPE>(0.0,1.0) which is C++ syntax invalid in C mode. MSVC UCRT's _Complex_I is an opaque struct that doesn't support arithmetic operators, so use clang's __builtin_complex(0.0, 1.0) to construct a double _Complex imaginary unit.
Add src/helpers/posix_memalign_compat.c with a Windows-compatible
implementation of posix_memalign and a paired elpa_aligned_free helper.
All code is guarded with #ifdef _WIN32; Linux builds are unaffected.
clang 21 introduced -Wdefault-const-init-var-unsafe, which fires on
const VLAs without initializers. The arrays are only used for their
sizeof — dropping const preserves the compile-time size computation
and silences the diagnostic.
flang-new (all upstream versions 17–23, including latest dev snapshot)
incorrectly dereferences type(c_ptr) arguments in bind(C) calls when
an integer(c_intptr_t) interface for the same C function is declared
first. This passes the pointer VALUE instead of its ADDRESS, causing
SIGSEGV in the NVIDIA driver for all eigensolver GPU tests (533 of
800 failures).

Route cuda_malloc{,_host}_cptr and hip_malloc{,_host}_cptr through
the intptr variant (integer(c_intptr_t), unaffected) and transfer
the result back to type(c_ptr).

gfortran, ifx, and AOCC flang are not affected — the workaround is
harmless for correct compilers.

Upstream: llvm/llvm-project#192655
GCC warns about functions marked __attribute__((always_inline)) that
may not be inlinable when the inline keyword is missing. Add it to
silence the warning and satisfy the compiler on ARM/aarch64.
OpenMP 5.1 deprecated !$omp master in favour of !$omp masked.
Switch the source to the newer directive while keeping HAVE_OMP_MASKED
guards so compilers without masked support can continue to use the old
path.

Detection of HAVE_OMP_MASKED is currently handled in the CMake build.
Use %zu instead of %d for size_t arguments in cusolver error/debug
messages. Prevents -Wformat warnings and potentially truncated output
on 64-bit platforms.
Add (void) cast for unused negative_or_positive parameter in
cuda_copy_skewsymmetric_first_half_q_{double,float}_FromC.
The parameter is API-symmetric with the _second_half variants
but unused because the minus-kernel always negates unconditionally.
Remove entries_in_sub_matrix, columns_in_sub_matrix, and
number_of_entries which are computed but never read.
SoShiny added 11 commits April 20, 2026 17:57
Add missing #undef before #define ROW_LENGTH in
real_128bit_256bit_512bit_BLOCK_template.c. The macro was defined
in the kernel-function-name section without a trailing #undef.
Add (char*) cast to string literal passed to elpa_timer.
Missing return statements in GPU vendor-agnostic fallback paths
cause -Wreturn-type. printf format specifier uses %d for a
size_t-derived index that is long on LP64.
These REAL(C_DATATYPE_KIND) externals shadow the double-precision
intrinsic ddot when the template is instantiated for single
precision, triggering -Wexternal-interface-mismatch on flang-new.
The routines never call ddot — the declarations are unreferenced.
vfmsq_f32(a, b, c) computes a - b*c, but _SIMD_NFMA(a, b, c) must
compute c - a*b (matching the non-FMA fallback: c - MUL(a,b)).
The double-precision counterpart vfmsq_f64(c, b, a) was already
correct; the single-precision variant had the first and third
arguments swapped.

This caused the NEON_ARCH64_BLOCK4 (and BLOCK6) kernels to produce
completely wrong results when operating on single-precision matrices,
giving orthogonality errors of order 1e2 instead of 1e-6.  BLOCK2
was unaffected as its rank-2 update path does not call _SIMD_NFMA.
…liaries

Apple Accelerate vecLib's single-precision LAPACK auxiliary routines
SLAMCH and SLAPY2 are non-functional on macOS Sequoia (confirmed 15.5,
Apple M4): SLAMCH returns 0.0 for every query, and SLAPY2 returns 0.0
or a large garbage constant for all inputs.  The double-precision
counterparts DLAMCH and DLAPY2 are unaffected.

These two broken routines were the root cause of all 116 single-precision
test failures when ELPA is linked against Accelerate:

- SLAMCH('E') = 0.0 collapses the D&C deflation tolerance
  TOL = 8*eps*max(dmax,zmax) to zero in merge_systems, causing wrong
  deflation decisions and therefore incorrect eigenvalues.

- SLAPY2(c,s) = 0.0 (or garbage) produces a zero Givens rotation norm
  in merge_systems, leading to division by zero / NaN in the deflation
  step; the same bug corrupts Householder reflector norms in the
  two-stage QR bandwidth-reduction (elpa_pdgeqrf_template).

Replace both calls with portable Fortran intrinsics — epsilon() and
hypot() — which are correct on all platforms and are what modern
Reference-LAPACK uses internally anyway.  With these fixes all 424
tests pass at -O3 against Accelerate.
Bare definitions of gpuMemcpyHostToDevice and gpuMemcpyDeviceToHost in
layerVariables.h relied on ELF tentative-definition merging to silently
deduplicate the symbols when multiple translation units are linked.
GCC made -fno-common the default in GCC 10, which turns such collisions
into link errors.  COFF (Windows) has never supported tentative
definitions and rejects them as hard duplicates.

Move the authoritative definitions to layerFunctions.c
unconditionally, and replace the bare definitions in both header files
with extern declarations.  Also move the extern int declarations in
layerFunctions.h inside the already-present extern "C" block so that
C++ translation units (e.g. test.cpp, which includes test.c) resolve
to the C-linkage symbols in elpatest.lib rather than emitting
C++-mangled references that can never be satisfied.
Introduce a CMake-based build path without modifying the
autotools files. Validated across x86_64 Linux, AArch64 Linux,
macOS Apple Silicon, and Windows with multiple compiler, MPI, and
math-library combinations — including CUDA where available.

See cmake_build_examples/README.md for the full validation matrix,
design decisions, and usage instructions.
NVTX v3 is header-only — the old libnvToolsExt.so no longer exists.
The CMake detection searched only for the legacy library, which caused
a warning when ELPA_NVTX=ON with a modern NVTX3 installation.

Detection now prefers NVTX3 headers (nvtx3/nvToolsExt.h) and falls
back to the legacy library. A small C shim (src/general/nvtx_impl.c)
compiles with NVTX_EXPORT_API to emit the externally-visible
nvtxRangePushA / nvtxRangePop symbols that Fortran bind(C) interfaces
require at link time.
Fortran complex(kind=8) stack variables are only 8-byte aligned, but the
compiler generates movapd (requires 16-byte alignment) when dereferencing
the host pointer as double2. Use memcpy instead of direct dereference in
gpu_set_e_vec_scale_set_one_store_v_row and gpu_store_u_v_in_uv_vu.

Affects 1-stage solver with >1 MPI rank per GPU (useCCL=false path).
…tion

The explicit-name C API (elpa_eigenvectors_double, etc.) calls is_device_ptr()
at runtime to auto-detect host vs device pointers — unlike the generic C API
macro, which resolves to _a_h_a_* at compile time and never touches CUDA.

The ROCm backend (elpa_explicit_name_amd_gpu.hip) already handles
hipPointerGetAttributes failure gracefully: it prints a warning, clears the
error, and returns 0 (host pointer).  The CUDA backend instead called
exit(1), crashing the process when the driver is missing or too old.

This patch makes the CUDA path behave identically to the ROCm path: warn,
clear the sticky error with cudaGetLastError(), and return 0.
beta = sign(dlapy2(alpha, xnorm), alpha)
#else
beta = sign(slapy2(alpha, xnorm), alpha)
! NOTE: slapy2 is broken in Apple Accelerate vecLib for single precision.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, apparently Apple's Accelerate library does have an old CLAPACK and a new LAPACK implementation and if one binds the fortran interfaces with the new LAPACK the functions work. The new symbols have $NEWLAPACK and $NEWLAPACK$ILP64 suffixes.

    function slapy2(x, y) bind(C, name="slapy2$NEWLAPACK") result(r)
      import :: c_ptr, c_float
      type(c_ptr), value :: x, y
      real(c_float) :: r
    end function

…a_h_a

qDev is declared but never allocated in this function (eigenvalues-only
path). The gpu_free was incorrectly copied from elpa_generalized_eigenvectors_a_h_a
where qDev is properly allocated. This caused cudaFree: invalid argument
on GPU.

Introduced in 017e3fc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant