add high-level PETSc SNES interface for nonlinear problems by jonas-heinzmann · Pull Request #19 · bpachev/cuda-dolfinx

jonas-heinzmann · 2026-03-07T11:39:42Z

Adds NonlinearProblem class in python/cudolfinx/petsc.py, which is adapted from and directly resembles dolfinx.fem.petsc.NonlinearProblem, as well as a demo in python/examples making use of this class.

Special attention is necessary to make the SNESLineSearches work: These rely on internal work vectors for both the solution iterate, as well as the residual to be assembled at the solution iterate. This necessitates copies between the assembled vectors and the PETSc work vectors.

Output for said example with python nonlinear.py --res 50:

750000 elements, 132651 DOFs

  0 SNES Function norm 2.100122195486e-03
    Linear nl_ solve converged due to CONVERGED_RTOL iterations 22
          0 Line search: fty/||y|| = 0.000788004, lambda = 0.5
          1 Line search: fty/||y|| = 0.000393428, lambda = 0.75
          2 Line search: fty/||y|| = 0.00019598, lambda = 0.875
          3 Line search: fty/||y|| = 9.72083e-05, lambda = 0.9375
          4 Line search: fty/||y|| = 4.78093e-05, lambda = 0.96875
          5 Line search: fty/||y|| = 2.31065e-05, lambda = 0.984375
          6 Line search: fty/||y|| = 1.07542e-05, lambda = 0.992188
          7 Line search: fty/||y|| = 4.57785e-06, lambda = 0.996094
          8 Line search: fty/||y|| = 1.48961e-06, lambda = 0.998047
          9 Line search: fty/||y|| = -5.45186e-08, lambda = 0.999023
        Line search: abs(fty)/||y|| = 5.45186e-08 <= atol = 1e-06
  1 SNES Function norm 5.006470670468e-06
    Linear nl_ solve converged due to CONVERGED_RTOL iterations 22
        Line search: sign of fty does not change in step interval, accepting full step
  2 SNES Function norm 3.036569126758e-11
  Nonlinear nl_ solve converged due to CONVERGED_FNORM_ABS iterations 2

elapsed time: 1.336 s (GPU)

and for python nonlinear.py --res 50 --no-cuda:

750000 elements, 132651 DOFs

  0 SNES Function norm 2.100122195486e-03
    Linear nl_ solve converged due to CONVERGED_RTOL iterations 22
          0 Line search: fty/||y|| = 0.000788004, lambda = 0.5
          1 Line search: fty/||y|| = 0.000393428, lambda = 0.75
          2 Line search: fty/||y|| = 0.00019598, lambda = 0.875
          3 Line search: fty/||y|| = 9.72083e-05, lambda = 0.9375
          4 Line search: fty/||y|| = 4.78093e-05, lambda = 0.96875
          5 Line search: fty/||y|| = 2.31065e-05, lambda = 0.984375
          6 Line search: fty/||y|| = 1.07542e-05, lambda = 0.992188
          7 Line search: fty/||y|| = 4.57785e-06, lambda = 0.996094
          8 Line search: fty/||y|| = 1.48961e-06, lambda = 0.998047
          9 Line search: fty/||y|| = -5.45186e-08, lambda = 0.999023
        Line search: abs(fty)/||y|| = 5.45186e-08 <= atol = 1e-06
  1 SNES Function norm 5.006470670469e-06
    Linear nl_ solve converged due to CONVERGED_RTOL iterations 22
        Line search: sign of fty does not change in step interval, accepting full step
  2 SNES Function norm 3.036569125318e-11
  Nonlinear nl_ solve converged due to CONVERGED_FNORM_ABS iterations 2

elapsed time: 8.248 s (CPU)

…ES (resembling dolfinx.fem.petsc.NonlinearProblem), and add demo solving a nonlinear problem with said interface class

…residual function

python/cudolfinx/petsc.py

jonas-heinzmann · 2026-03-10T09:45:37Z

@bpachev One common use case for non-linear problems is to apply the BCs in an incremental way, i.e. to have some sort of load stepping and update the values of the BCs with the load step (or one might have time-dependent BCs). It seems to me that this does not work if one uses the packed CUDADirichletBC returned from cudolfinx.assemble.CUDAAssembler.pack_bcs(), i.e. if updating the underlying dolfinx.fem.bcs.DirichletBC that this has no effect on the CUDADirichletBC.

bpachev · 2026-03-10T20:00:19Z

@bpachev One common use case for non-linear problems is to apply the BCs in an incremental way, i.e. to have some sort of load stepping and update the values of the BCs with the load step (or one might have time-dependent BCs). It seems to me that this does not work if one uses the packed CUDADirichletBC returned from cudolfinx.assemble.CUDAAssembler.pack_bcs(), i.e. if updating the underlying dolfinx.fem.bcs.DirichletBC that this has no effect on the CUDADirichletBC.

@jonas-heinzmann The CUDADirichletBC object has an update method for precisely this reason.

jonas-heinzmann · 2026-03-10T20:30:30Z

@bpachev One common use case for non-linear problems is to apply the BCs in an incremental way, i.e. to have some sort of load stepping and update the values of the BCs with the load step (or one might have time-dependent BCs). It seems to me that this does not work if one uses the packed CUDADirichletBC returned from cudolfinx.assemble.CUDAAssembler.pack_bcs(), i.e. if updating the underlying dolfinx.fem.bcs.DirichletBC that this has no effect on the CUDADirichletBC.

@jonas-heinzmann The CUDADirichletBC object has an update method for precisely this reason.

I see, sorry for missing that! What do you think of calling it in the solve method such that the user doesn't need to take care of this? Then one could exchange the dolfinx-version of the NonlinearProblem class with the cudolfinx-version 1:1 without having to worry about extra function calls.

bpachev · 2026-03-12T21:48:47Z

@bpachev One common use case for non-linear problems is to apply the BCs in an incremental way, i.e. to have some sort of load stepping and update the values of the BCs with the load step (or one might have time-dependent BCs). It seems to me that this does not work if one uses the packed CUDADirichletBC returned from cudolfinx.assemble.CUDAAssembler.pack_bcs(), i.e. if updating the underlying dolfinx.fem.bcs.DirichletBC that this has no effect on the CUDADirichletBC.

@jonas-heinzmann The CUDADirichletBC object has an update method for precisely this reason.

I see, sorry for missing that! What do you think of calling it in the solve method such that the user doesn't need to take care of this? Then one could exchange the dolfinx-version of the NonlinearProblem class with the cudolfinx-version 1:1 without having to worry about extra function calls.

I think that is a great idea. I have very little documentation currently, so it is easy to miss. The method does require the list of boundary conditions that need to be updated, but it should be fine to update all of them just in case.

… multi-GPU residual assembly

jonas-heinzmann · 2026-03-14T23:35:30Z

Previously, the CUDAVector binding had a hardcoded include_ghosts=False which could lead to errors in the _assemble_residual method since there are necessary copies between u and x (the apply_lifting kernel would access x0 at ghost DOF indices, causing an illegal memory access error). Accordingly, I now added the keyword argument include_ghosts (default: False to have the same behavior as before per default) to the wrapper, and exposed the to_device() method to be called after the ghost update.

jonas-heinzmann · 2026-03-14T23:37:35Z

@bpachev I believe this should be ready now, could you take another look? Thanks a lot!

bpachev

Good catch on the ghosting for x0 - this is another reason to rework the API to directly use PETSc Vectors. Would you be able to address the comments about copying b and using ghost_layer_mesh in the demo? Also, have you tested to ensure that the results on the GPU are the same as those on the CPU, particularly in parallel?

bpachev · 2026-03-10T19:33:45Z

python/cudolfinx/petsc.py

+        )
+
+        # copy assembled vector to PETSc work vector
+        self._cuda_b.vector.copy(b)


If you create a CUDAVector for b like what you have for x and use that in the assembly routines, you can avoid this copy.

I used nsys to profile the example you shared, with and without CUDAVector for b.
Original:

** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum): Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation ---------- ----- -------- -------- -------- -------- ----------- ------------------------------ 541.068 653 0.829 0.003 0.000 65.059 4.935 [CUDA memcpy Host-to-Device] 192.048 453 0.424 0.042 0.002 3.903 0.531 [CUDA memcpy Device-to-Device] 165.536 427 0.388 0.003 0.000 7.712 1.110 [CUDA memset] 108.620 391 0.278 0.000 0.000 32.529 2.019 [CUDA memcpy Device-to-Host]

Using CUDAVector instead of copying:

** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum): Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation ---------- ----- -------- -------- -------- -------- ----------- ------------------------------ 541.068 653 0.829 0.003 0.000 65.059 4.935 [CUDA memcpy Host-to-Device] 179.314 441 0.407 0.042 0.002 3.903 0.528 [CUDA memcpy Device-to-Device] 165.536 427 0.388 0.003 0.000 7.712 1.110 [CUDA memset] 108.620 391 0.278 0.000 0.000 32.529 2.019 [CUDA memcpy Device-to-Host]

The impact on runtime is pretty negligible since that is dominated by the linear solver. But it does avoid the copy and does not increase memory movement elsewhere.

Good point! I now wrapped the b provided by PETSc into a CUDAVector inside _assemble_residual.

bpachev · 2026-03-10T19:36:44Z

python/examples/nonlinear.py

+def main(res: int = 30, degree: int = 1, dim: int = 3, cuda: bool = True):
+    """solve a nonlinear problem on a CPU or GPU"""
+
+    domain = create_mesh(res, dim=dim)


In order for the demo to work in parallel, you need to do domain = cudolfinx.ghost_layer_mesh(domain), in the CUDA case with more than one process like in the Poisson example. Have you tested the correctness of your interface with multiple MPI processes?

I had this just a few lines below that, so the demo is also working also in parallel. But good point, I added a test in python/test/test_nonlinear_problem.py to compare CPU and GPU results (working also in parallel).

… hence to avoid a copy in _assemble_residual function

jonas-heinzmann added 2 commits March 7, 2026 12:31

add cuda-backed nonlinear problem class as high-level interface to SN…

acf5699

…ES (resembling dolfinx.fem.petsc.NonlinearProblem), and add demo solving a nonlinear problem with said interface class

add missing vector x for the application of the BCs in the _assemble_…

983c43f

…residual function

jonas-heinzmann commented Mar 10, 2026

View reviewed changes

python/cudolfinx/petsc.py Outdated Show resolved Hide resolved

jonas-heinzmann added 2 commits March 14, 2026 23:23

fix application of BCs, and update during solve() call

6497a54

expose include_ghosts and to_device in CUDAVector binding for correct…

16920ab

… multi-GPU residual assembly

bpachev requested changes Mar 17, 2026

View reviewed changes

wrap PETSc b vector into CUDAVector to directly assemble into it, and…

4e4a06f

… hence to avoid a copy in _assemble_residual function

jonas-heinzmann requested a review from bpachev March 17, 2026 21:48

jonas-heinzmann force-pushed the add-nonlinear-problem-snes-interface branch from ae754d5 to a887842 Compare March 17, 2026 21:50

add test for nonlinear problem class

5d59b61

jonas-heinzmann force-pushed the add-nonlinear-problem-snes-interface branch from a887842 to 5d59b61 Compare March 17, 2026 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add high-level PETSc SNES interface for nonlinear problems#19

add high-level PETSc SNES interface for nonlinear problems#19
jonas-heinzmann wants to merge 6 commits intobpachev:mainfrom
jonas-heinzmann:add-nonlinear-problem-snes-interface

jonas-heinzmann commented Mar 7, 2026

Uh oh!

Uh oh!

jonas-heinzmann commented Mar 10, 2026

Uh oh!

bpachev commented Mar 10, 2026

Uh oh!

jonas-heinzmann commented Mar 10, 2026

Uh oh!

bpachev commented Mar 12, 2026

Uh oh!

jonas-heinzmann commented Mar 14, 2026

Uh oh!

jonas-heinzmann commented Mar 14, 2026

Uh oh!

bpachev left a comment

Uh oh!

bpachev Mar 10, 2026

Uh oh!

bpachev Mar 10, 2026

Uh oh!

jonas-heinzmann Mar 17, 2026

Uh oh!

bpachev Mar 10, 2026

Uh oh!

jonas-heinzmann Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonas-heinzmann commented Mar 7, 2026

Uh oh!

Uh oh!

jonas-heinzmann commented Mar 10, 2026

Uh oh!

bpachev commented Mar 10, 2026

Uh oh!

jonas-heinzmann commented Mar 10, 2026

Uh oh!

bpachev commented Mar 12, 2026

Uh oh!

jonas-heinzmann commented Mar 14, 2026

Uh oh!

jonas-heinzmann commented Mar 14, 2026

Uh oh!

bpachev left a comment

Choose a reason for hiding this comment

Uh oh!

bpachev Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

bpachev Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

jonas-heinzmann Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

bpachev Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

jonas-heinzmann Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants