Add CMake build system overlay with cross-platform support and source fixes#71
Open
SoShiny wants to merge 42 commits into
Open
Add CMake build system overlay with cross-platform support and source fixes#71SoShiny wants to merge 42 commits into
SoShiny wants to merge 42 commits into
Conversation
Replace undefined _XOR_EPI macro call with _SIMD_XOR_EPI in the AVX-512 Xeon Phi path of the complex BLOCK kernel template. Only _SIMD_XOR_EPI is defined. The code compiles when the ifdef is inactive but would fail on Knights Landing processors.
Caller sees stale error code when the Cholesky decomposition fails because the function falls through without returning the failure status. Set a unique error code (141414) and return immediately.
last_stripe_width was computed inside if(useGPU) but used unconditionally on the CPU path. On Linux the stack happened to be zeroed, masking the bug. On Windows the uninitialized value caused wrong stripe decomposition and 2stage eigenvector corruption.
In tridiag_template.F90, my_stream was declared private in the OpenMP parallel directive, leaving each thread with an uninitialized stream handle. Changed to firstprivate so every thread copies the initialized value, preventing access violations in the CUDA driver.
elpa_setup() and elpa_setup_gpu() were called inside assert_elpa_ok() which expands to assert(). With NDEBUG the calls are silently removed, leaving the ELPA handle uninitialized and causing crashes. Move the calls outside the assert.
sscanf "%lf" reads a double into a float* pointer, corrupting the stack. Changed to "%f". Also added missing return values in _enumerate switch functions and a portable _lfind wrapper for Windows.
Initialize my_stream to 0 before the WITH_GPU_STREAMS conditional in solve_tridi_col, solve_tridi_single_problem, and merge_systems templates. Prevents run-time uninitialized-variable errors when GPU streams are disabled.
Initialize return variables (flag=-1, success=.false., version=0) before preprocessor-guarded assignments in GPU vendor-agnostic CCL and BLAS layer templates. Prevents compiler warnings and undefined behavior when compiled without a GPU backend.
Initialize version=0 and success=.false. in GPU vendor-agnostic and cusolver templates before preprocessor-guarded assignments. Suppresses ifort #6178 on non-GPU builds.
Initialize the version result to 0 before the WITH_AMD_ROCSOLVER guard so the variable is always defined on non-ROCm builds.
The umc_dev pointer arithmetic in the GPU skewsymmetric path of bandred used (n_cols+1+1) instead of (n_cols+1-1) for the Fortran-to-C 0-based index conversion. This read 2 columns past the intended position, causing wrong eigenvalues for small matrices and CUDA memory errors for large matrices. The CPU path and GPU symmetric path were correct.
The Fortran variant used an unbounded do while(autotune_step()) loop that exhausts the full ELPA_AUTOTUNE_FAST search space. With CUDA enabled the search space contains hundreds of parameter combinations plus two call sleep(2) per iteration, making the test run for 600+ seconds. The C and C++ variants already cap the loop at 20 SCF steps. Apply the same limit to the Fortran variant and remove the sleep() calls.
Add conditional includes for POSIX-only headers, complex-type wrappers, and Windows-specific guards around the aligned allocation interfaces. All changes are guarded with #ifdef _WIN32 and leave Linux behavior unchanged.
Add platform-aware complex type macros for the SIMD kernel template so clang-cl (which supports C99 _Complex unlike MSVC cl.exe) can compile the complex BLOCK kernels on Windows.
MSVC headers redefine 'complex' as a macro via corecrt_math.h. Use double_complex/float_complex platform macros in _Generic selections to avoid conflicts.
Add alloca.h fallback for C and CUDA tests, use C99 _Complex types directly for ccache+clang-cl compatibility.
Add alloca.h/malloc.h fallback and replace _Complex types with cuComplex types in extern C signatures. nvcc on Windows uses MSVC cl.exe as host compiler which does not support _Complex.
Initialize module-level integer variables to 0 so ifort emits BSS/DATA symbols instead of COMMON. COMMON symbols are not visible to the Windows linker for DLL export.
Use direct function-call assignment instead of module-variable intermediaries in set_gpu_parameters. Skip elpa_ccl_gpu module writes on Windows (NCCL is Linux-only) to avoid cross-DLL data symbol access that lld-link cannot resolve.
The C++ compilation path needs Complex_I defined as std::complex<EV_TYPE>(0.0,1.0) before use. Without it the test fails to compile as C++.
The test includes <unistd.h> and calls sleep() which are POSIX-only. On Windows, map sleep() to Win32 Sleep() via <windows.h>.
The _WIN32 branch defined Complex_I as std::complex<EV_TYPE>(0.0,1.0) which is C++ syntax invalid in C mode. MSVC UCRT's _Complex_I is an opaque struct that doesn't support arithmetic operators, so use clang's __builtin_complex(0.0, 1.0) to construct a double _Complex imaginary unit.
Add src/helpers/posix_memalign_compat.c with a Windows-compatible implementation of posix_memalign and a paired elpa_aligned_free helper. All code is guarded with #ifdef _WIN32; Linux builds are unaffected.
clang 21 introduced -Wdefault-const-init-var-unsafe, which fires on const VLAs without initializers. The arrays are only used for their sizeof — dropping const preserves the compile-time size computation and silences the diagnostic.
flang-new (all upstream versions 17–23, including latest dev snapshot)
incorrectly dereferences type(c_ptr) arguments in bind(C) calls when
an integer(c_intptr_t) interface for the same C function is declared
first. This passes the pointer VALUE instead of its ADDRESS, causing
SIGSEGV in the NVIDIA driver for all eigensolver GPU tests (533 of
800 failures).
Route cuda_malloc{,_host}_cptr and hip_malloc{,_host}_cptr through
the intptr variant (integer(c_intptr_t), unaffected) and transfer
the result back to type(c_ptr).
gfortran, ifx, and AOCC flang are not affected — the workaround is
harmless for correct compilers.
Upstream: llvm/llvm-project#192655
GCC warns about functions marked __attribute__((always_inline)) that may not be inlinable when the inline keyword is missing. Add it to silence the warning and satisfy the compiler on ARM/aarch64.
OpenMP 5.1 deprecated !$omp master in favour of !$omp masked. Switch the source to the newer directive while keeping HAVE_OMP_MASKED guards so compilers without masked support can continue to use the old path. Detection of HAVE_OMP_MASKED is currently handled in the CMake build.
Use %zu instead of %d for size_t arguments in cusolver error/debug messages. Prevents -Wformat warnings and potentially truncated output on 64-bit platforms.
Add (void) cast for unused negative_or_positive parameter in
cuda_copy_skewsymmetric_first_half_q_{double,float}_FromC.
The parameter is API-symmetric with the _second_half variants
but unused because the minus-kernel always negates unconditionally.
Remove entries_in_sub_matrix, columns_in_sub_matrix, and number_of_entries which are computed but never read.
Add missing #undef before #define ROW_LENGTH in real_128bit_256bit_512bit_BLOCK_template.c. The macro was defined in the kernel-function-name section without a trailing #undef.
Add (char*) cast to string literal passed to elpa_timer.
Missing return statements in GPU vendor-agnostic fallback paths cause -Wreturn-type. printf format specifier uses %d for a size_t-derived index that is long on LP64.
These REAL(C_DATATYPE_KIND) externals shadow the double-precision intrinsic ddot when the template is instantiated for single precision, triggering -Wexternal-interface-mismatch on flang-new. The routines never call ddot — the declarations are unreferenced.
vfmsq_f32(a, b, c) computes a - b*c, but _SIMD_NFMA(a, b, c) must compute c - a*b (matching the non-FMA fallback: c - MUL(a,b)). The double-precision counterpart vfmsq_f64(c, b, a) was already correct; the single-precision variant had the first and third arguments swapped. This caused the NEON_ARCH64_BLOCK4 (and BLOCK6) kernels to produce completely wrong results when operating on single-precision matrices, giving orthogonality errors of order 1e2 instead of 1e-6. BLOCK2 was unaffected as its rank-2 update path does not call _SIMD_NFMA.
…liaries
Apple Accelerate vecLib's single-precision LAPACK auxiliary routines
SLAMCH and SLAPY2 are non-functional on macOS Sequoia (confirmed 15.5,
Apple M4): SLAMCH returns 0.0 for every query, and SLAPY2 returns 0.0
or a large garbage constant for all inputs. The double-precision
counterparts DLAMCH and DLAPY2 are unaffected.
These two broken routines were the root cause of all 116 single-precision
test failures when ELPA is linked against Accelerate:
- SLAMCH('E') = 0.0 collapses the D&C deflation tolerance
TOL = 8*eps*max(dmax,zmax) to zero in merge_systems, causing wrong
deflation decisions and therefore incorrect eigenvalues.
- SLAPY2(c,s) = 0.0 (or garbage) produces a zero Givens rotation norm
in merge_systems, leading to division by zero / NaN in the deflation
step; the same bug corrupts Householder reflector norms in the
two-stage QR bandwidth-reduction (elpa_pdgeqrf_template).
Replace both calls with portable Fortran intrinsics — epsilon() and
hypot() — which are correct on all platforms and are what modern
Reference-LAPACK uses internally anyway. With these fixes all 424
tests pass at -O3 against Accelerate.
Bare definitions of gpuMemcpyHostToDevice and gpuMemcpyDeviceToHost in layerVariables.h relied on ELF tentative-definition merging to silently deduplicate the symbols when multiple translation units are linked. GCC made -fno-common the default in GCC 10, which turns such collisions into link errors. COFF (Windows) has never supported tentative definitions and rejects them as hard duplicates. Move the authoritative definitions to layerFunctions.c unconditionally, and replace the bare definitions in both header files with extern declarations. Also move the extern int declarations in layerFunctions.h inside the already-present extern "C" block so that C++ translation units (e.g. test.cpp, which includes test.c) resolve to the C-linkage symbols in elpatest.lib rather than emitting C++-mangled references that can never be satisfied.
Introduce a CMake-based build path without modifying the autotools files. Validated across x86_64 Linux, AArch64 Linux, macOS Apple Silicon, and Windows with multiple compiler, MPI, and math-library combinations — including CUDA where available. See cmake_build_examples/README.md for the full validation matrix, design decisions, and usage instructions.
NVTX v3 is header-only — the old libnvToolsExt.so no longer exists. The CMake detection searched only for the legacy library, which caused a warning when ELPA_NVTX=ON with a modern NVTX3 installation. Detection now prefers NVTX3 headers (nvtx3/nvToolsExt.h) and falls back to the legacy library. A small C shim (src/general/nvtx_impl.c) compiles with NVTX_EXPORT_API to emit the externally-visible nvtxRangePushA / nvtxRangePop symbols that Fortran bind(C) interfaces require at link time.
Fortran complex(kind=8) stack variables are only 8-byte aligned, but the compiler generates movapd (requires 16-byte alignment) when dereferencing the host pointer as double2. Use memcpy instead of direct dereference in gpu_set_e_vec_scale_set_one_store_v_row and gpu_store_u_v_in_uv_vu. Affects 1-stage solver with >1 MPI rank per GPU (useCCL=false path).
…tion The explicit-name C API (elpa_eigenvectors_double, etc.) calls is_device_ptr() at runtime to auto-detect host vs device pointers — unlike the generic C API macro, which resolves to _a_h_a_* at compile time and never touches CUDA. The ROCm backend (elpa_explicit_name_amd_gpu.hip) already handles hipPointerGetAttributes failure gracefully: it prints a warning, clears the error, and returns 0 (host pointer). The CUDA backend instead called exit(1), crashing the process when the driver is missing or too old. This patch makes the CUDA path behave identically to the ROCm path: warn, clear the sticky error with cudaGetLastError(), and return 0.
SoShiny
commented
Apr 28, 2026
| beta = sign(dlapy2(alpha, xnorm), alpha) | ||
| #else | ||
| beta = sign(slapy2(alpha, xnorm), alpha) | ||
| ! NOTE: slapy2 is broken in Apple Accelerate vecLib for single precision. |
Author
There was a problem hiding this comment.
So, apparently Apple's Accelerate library does have an old CLAPACK and a new LAPACK implementation and if one binds the fortran interfaces with the new LAPACK the functions work. The new symbols have $NEWLAPACK and $NEWLAPACK$ILP64 suffixes.
function slapy2(x, y) bind(C, name="slapy2$NEWLAPACK") result(r)
import :: c_ptr, c_float
type(c_ptr), value :: x, y
real(c_float) :: r
end function…a_h_a qDev is declared but never allocated in this function (eigenvalues-only path). The gpu_free was incorrectly copied from elpa_generalized_eigenvectors_a_h_a where qDev is properly allocated. This caused cudaFree: invalid argument on GPU. Introduced in 017e3fc.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This work originates from Synopsys QuantumATK's need to build ELPA on Windows,
aligning it with the existing Linux version. The CMake build system is added as a
supplemental overlay alongside the existing autotools setup; with some relatively
minor extra work it could serve as a replacement.
Advantages over autotools:
While developing the CMake overlay, several in-source issues were discovered and
fixed — ranging from hard bugs (crashes, wrong results) to build/runtime warning
cleanups. ARM (AArch64) and macOS support was explored as well, driven by the
recurring demand to run on Apple Silicon.
What's included
CMake build system (new, additive)
CMakeLists.txt+ modularcmake/modules covering compileroptions, feature checks, kernel selection, Fortran preprocessing, generated
files, CUDA dependencies, MPI, OpenMP, math libraries, selective symbol
export, install, testing, and packaging.
generation, test generation, and constant processing.
Intel ICX/IFX — with MKL, OpenBLAS, BLIS; OpenMPI and Intel MPI; x86‑64 +
AArch64), Windows (Clang-CL + ifort/ifx), and macOS ARM64 (GCC +
Accelerate/OpenBLAS).
primarily for documentation and reproducibility rather than CI-grade automation.
Bug fixes (correctness)
my_streammust befirstprivatein OpenMP regions.elpa_indexsscanf stack corruption and missing returns.templates.
_XOR_EPIin AVX-512 Xeon Phi complex kernel._SIMD_NFMAargument order.gpuMemcpy*common-symbol ODR violations in test layer.Warning and portability fixes
omp masterwithomp masked.redefinitions, dead variables, unused parameters, and VLA const qualifiers.
inlinekeyword to__forceinlinemacro for standard compliance.type(c_ptr) bind(C)codegen bug.posix_memalignshim,Complex_Imacro, portablecomplex type macros,
_Genericselections, POSIX-only API guards, DLL symbolexport initialization, and GPU/test/CUDA source adjustments.
test_multiple_objsautotuning loop at 20 SCF steps.What's missing
The following autotools features are not yet implemented in the CMake overlay:
make doc).make dist/ tarball packaging.configureflags (e.g. some niche kernelselections or experimental options may be missing).
Scope
134 files changed, ~10 300 insertions, ~300 deletions.
The ~9 000 lines of insertions are the CMake overlay and build examples;
the ~1 300 lines touch existing sources (fixes + portability guards).