socu - Structured Optimization using CUDA

Overview

socu is a high-performance library for solving block tridiagonal linear systems using GPU-accelerated Cholesky factorization. These systems arise naturally in numerous real-time applications across engineering and scientific computing, including model predictive control (MPC), trajectory optimization, Kalman filtering, and robotics applications where temporal dynamics create structured sparsity patterns. The library is designed with extensibility in mind to support more general problem structures in the future.

Through a multi-stage permutation strategy based on nested dissection, socu reduces the computational complexity from O(Nn³) for sequential factorization to O(log₂(N)n³) when sufficient parallel resources are available, where n is the block size and N is the number of blocks.

The library is implemented using NVIDIA's Warp library as the computational backend and provides an interface to Jax.

Installation

Prerequisites

Python 3.9 or later
NVIDIA GPU with CUDA support

Install from source

pip install git+https://github.com/PREDICT-EPFL/socu.git

Or clone and install in editable mode:

git clone https://github.com/PREDICT-EPFL/socu.git
cd socu
pip install -e .

Usage

Problem Setup

socu solves linear systems with block tridiagonal structure of the form:

$$\Psi x = b$$

where $\Psi$ is a symmetric positive definite block tridiagonal matrix:

$$ \Psi = \begin{bmatrix} D_1 & E_1^T & & & \\ E_1 & D_2 & E_2^T & & \\ & E_2 & D_3 & \ddots & \\ & & \ddots & \ddots & E_{N-1}^T \\ & & & E_{N-1} & D_N \end{bmatrix} $$

with:

$D_i \in \mathbb{R}^{n \times n}$ - symmetric diagonal blocks
$E_i \in \mathbb{R}^{n \times n}$ - off-diagonal blocks
$N$ - number of blocks
$n$ - block size

Data Layout

Diagonal blocks L: shape $(N, n, n)$ - symmetric positive definite matrices
Off-diagonal blocks E: shape $(N-1, n, n)$ for JAX, computed size for Warp
Right-hand side b / solution x: shape $(N, n, 1)$ or $(N, n, m)$ for multiple RHS

Basic Example with Warp Interface

import numpy as np
import warp as wp
from socu.block_tridiag_solver import (
    create_cholesky_factor_launch,
    create_cholesky_solve_launch,
    create_cholesky_factor_and_solve_launch,
    calculate_off_diag_storage_len,
)

# Problem dimensions
n = 32  # block size
N = 100  # number of blocks

# Generate lower triangular Cholesky factors to ensure positive definiteness
D_chol = np.zeros((N, n, n))
E_chol = np.zeros((N-1, n, n))

for i in range(N):
    D_chol[i] = np.tril(np.random.randn(n, n)) + 10 * np.eye(n)
    if i < N-1:
        E_chol[i] = np.random.randn(n, n)

# Construct block tridiagonal matrix
L_np = np.zeros((N, n, n))
E_np = np.zeros((calculate_off_diag_storage_len(N), n, n)) # allocate correct size

for i in range(N):
    L_np[i] = D_chol[i] @ D_chol[i].T
    if i > 0:
        L_np[i] += E_chol[i-1] @ E_chol[i-1].T
    if i < N-1:
        E_np[i] = E_chol[i] @ D_chol[i].T

# Generate random right-hand side
b_np = np.random.randn(N, n, 1)

# Convert to Warp arrays
device = 'cuda'
dtype = wp.float64
L = wp.from_numpy(L_np, dtype=dtype, device=device)
E = wp.from_numpy(E_np, dtype=dtype, device=device)
x = wp.from_numpy(b_np, dtype=dtype, device=device)

# Create launch functions
cholesky_factor_launch = create_cholesky_factor_launch(
    L, E, device=device, dtype=dtype
)
cholesky_solve_launch = create_cholesky_solve_launch(
    L, E, x, device=device, dtype=dtype
)

# Solve: Factor then solve
cholesky_factor_launch()  # Compute Cholesky factorization
cholesky_solve_launch()   # Solve for x

# Get solution
x_solution = x.numpy()

# Verify solution correctness
# Reconstruct full matrix
Psi_full = np.zeros((N*n, N*n))
for i in range(N):
    Psi_full[i*n:(i+1)*n, i*n:(i+1)*n] = L_np[i]
    if i < N-1:
        Psi_full[(i+1)*n:(i+2)*n, i*n:(i+1)*n] = E_np[i]
        Psi_full[i*n:(i+1)*n, (i+1)*n:(i+2)*n] = E_np[i].T

# Check residual
residual = np.linalg.norm(b_np.flatten() - Psi_full @ x_solution.flatten())
assert residual < 1e-8, f"Solution error too large: {residual}"
print(f"Solution verified! Residual: {residual:.2e}")

Combined Factor and Solve

For better performance, you can use the combined operation that interleaves factorization and forward substitution:

# Reset x to b
x = wp.from_numpy(b_np, dtype=dtype, device=device)
L = wp.from_numpy(L_np, dtype=dtype, device=device)
E = wp.from_numpy(E_np, dtype=dtype, device=device)

# Combined factor and solve (faster)
cholesky_factor_and_solve_launch = create_cholesky_factor_and_solve_launch(
    L, E, x, device=device, dtype=dtype
)

cholesky_factor_and_solve_launch()
x_solution = x.numpy()

# Verify solution
residual = np.linalg.norm(b_np.flatten() - Psi_full @ x_solution.flatten())
assert residual < 1e-8, f"Solution error too large: {residual}"

JAX Interface

import jax.numpy as jnp
from jax import config
from socu.jax import cholesky_factor, cholesky_solve, cholesky_factor_and_solve

config.update("jax_enable_x64", True)

# Create JAX arrays
L_jax = jnp.array(L_np, dtype=jnp.float64)
E_jax = jnp.array(E_np, dtype=jnp.float64)
b_jax = jnp.array(b_np, dtype=jnp.float64)

# Separate factor and solve
L_factor, E_factor = cholesky_factor(L_jax, E_jax)
x_solution = cholesky_solve(L_factor, E_factor, b_jax)

# Verify solution
Psi_full_jax = jnp.array(Psi_full)
residual = jnp.linalg.norm(b_jax.flatten() - Psi_full_jax @ x_solution.flatten())
assert residual < 1e-8, f"Solution error too large: {residual}"

# Or combined
L_factor, E_factor, x_solution = cholesky_factor_and_solve(L_jax, E_jax, b_jax)

Performance Optimization

For optimal performance:

Block size alignment: Use block sizes $n$ that are multiples of 8 (for float64) or 16 (for float32). The JAX interface pads the input automatically by default (pad_problem=True), i.e., manual alignment is not needed, but the Warp interface doesn't.

CUDA graphs: Enable CUDA graphs for reduced kernel launch overhead:

cholesky_factor_launch = create_cholesky_factor_launch(
    L, E, device=device, dtype=dtype, use_cuda_graph=True
)

Precision selection: Use wp.float32 / jnp.float32 for up to 4x speedup when precision allows

Citing our Work

If you found socu useful in your scientific work, we encourage you to cite our preprint:

@misc{schwan2026socu,
  author = {Roland Schwan and Daniel Kuhn and Colin N. Jones},
  title = {{GPU}-Accelerated {Cholesky} Factorization of Block Tridiagonal Matrices},
  year = {2026},
  eprint = {arXiv:2601.03754},
}

License

socu is released under the BSD 2-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
socu		socu
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

socu - Structured Optimization using CUDA

Overview

Installation

Prerequisites

Install from source

Usage

Problem Setup

Data Layout

Basic Example with Warp Interface

Combined Factor and Solve

JAX Interface

Performance Optimization

Citing our Work

License

About

Uh oh!

Releases

Packages

Languages

License

PREDICT-EPFL/socu

Folders and files

Latest commit

History

Repository files navigation

socu - Structured Optimization using CUDA

Overview

Installation

Prerequisites

Install from source

Usage

Problem Setup

Data Layout

Basic Example with Warp Interface

Combined Factor and Solve

JAX Interface

Performance Optimization

Citing our Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages