Skip to content

add high-level PETSc SNES interface for nonlinear problems#19

Open
jonas-heinzmann wants to merge 6 commits intobpachev:mainfrom
jonas-heinzmann:add-nonlinear-problem-snes-interface
Open

add high-level PETSc SNES interface for nonlinear problems#19
jonas-heinzmann wants to merge 6 commits intobpachev:mainfrom
jonas-heinzmann:add-nonlinear-problem-snes-interface

Conversation

@jonas-heinzmann
Copy link

Adds NonlinearProblem class in python/cudolfinx/petsc.py, which is adapted from and directly resembles dolfinx.fem.petsc.NonlinearProblem, as well as a demo in python/examples making use of this class.

Special attention is necessary to make the SNESLineSearches work: These rely on internal work vectors for both the solution iterate, as well as the residual to be assembled at the solution iterate. This necessitates copies between the assembled vectors and the PETSc work vectors.

Output for said example with python nonlinear.py --res 50:

750000 elements, 132651 DOFs

  0 SNES Function norm 2.100122195486e-03
    Linear nl_ solve converged due to CONVERGED_RTOL iterations 22
          0 Line search: fty/||y|| = 0.000788004, lambda = 0.5
          1 Line search: fty/||y|| = 0.000393428, lambda = 0.75
          2 Line search: fty/||y|| = 0.00019598, lambda = 0.875
          3 Line search: fty/||y|| = 9.72083e-05, lambda = 0.9375
          4 Line search: fty/||y|| = 4.78093e-05, lambda = 0.96875
          5 Line search: fty/||y|| = 2.31065e-05, lambda = 0.984375
          6 Line search: fty/||y|| = 1.07542e-05, lambda = 0.992188
          7 Line search: fty/||y|| = 4.57785e-06, lambda = 0.996094
          8 Line search: fty/||y|| = 1.48961e-06, lambda = 0.998047
          9 Line search: fty/||y|| = -5.45186e-08, lambda = 0.999023
        Line search: abs(fty)/||y|| = 5.45186e-08 <= atol = 1e-06
  1 SNES Function norm 5.006470670468e-06
    Linear nl_ solve converged due to CONVERGED_RTOL iterations 22
        Line search: sign of fty does not change in step interval, accepting full step
  2 SNES Function norm 3.036569126758e-11
  Nonlinear nl_ solve converged due to CONVERGED_FNORM_ABS iterations 2

elapsed time: 1.336 s (GPU)

and for python nonlinear.py --res 50 --no-cuda:

750000 elements, 132651 DOFs

  0 SNES Function norm 2.100122195486e-03
    Linear nl_ solve converged due to CONVERGED_RTOL iterations 22
          0 Line search: fty/||y|| = 0.000788004, lambda = 0.5
          1 Line search: fty/||y|| = 0.000393428, lambda = 0.75
          2 Line search: fty/||y|| = 0.00019598, lambda = 0.875
          3 Line search: fty/||y|| = 9.72083e-05, lambda = 0.9375
          4 Line search: fty/||y|| = 4.78093e-05, lambda = 0.96875
          5 Line search: fty/||y|| = 2.31065e-05, lambda = 0.984375
          6 Line search: fty/||y|| = 1.07542e-05, lambda = 0.992188
          7 Line search: fty/||y|| = 4.57785e-06, lambda = 0.996094
          8 Line search: fty/||y|| = 1.48961e-06, lambda = 0.998047
          9 Line search: fty/||y|| = -5.45186e-08, lambda = 0.999023
        Line search: abs(fty)/||y|| = 5.45186e-08 <= atol = 1e-06
  1 SNES Function norm 5.006470670469e-06
    Linear nl_ solve converged due to CONVERGED_RTOL iterations 22
        Line search: sign of fty does not change in step interval, accepting full step
  2 SNES Function norm 3.036569125318e-11
  Nonlinear nl_ solve converged due to CONVERGED_FNORM_ABS iterations 2

elapsed time: 8.248 s (CPU)

…ES (resembling dolfinx.fem.petsc.NonlinearProblem), and add demo solving a nonlinear problem with said interface class
@jonas-heinzmann
Copy link
Author

@bpachev One common use case for non-linear problems is to apply the BCs in an incremental way, i.e. to have some sort of load stepping and update the values of the BCs with the load step (or one might have time-dependent BCs). It seems to me that this does not work if one uses the packed CUDADirichletBC returned from cudolfinx.assemble.CUDAAssembler.pack_bcs(), i.e. if updating the underlying dolfinx.fem.bcs.DirichletBC that this has no effect on the CUDADirichletBC.

@bpachev
Copy link
Owner

bpachev commented Mar 10, 2026

@bpachev One common use case for non-linear problems is to apply the BCs in an incremental way, i.e. to have some sort of load stepping and update the values of the BCs with the load step (or one might have time-dependent BCs). It seems to me that this does not work if one uses the packed CUDADirichletBC returned from cudolfinx.assemble.CUDAAssembler.pack_bcs(), i.e. if updating the underlying dolfinx.fem.bcs.DirichletBC that this has no effect on the CUDADirichletBC.

@jonas-heinzmann The CUDADirichletBC object has an update method for precisely this reason.

@jonas-heinzmann
Copy link
Author

@bpachev One common use case for non-linear problems is to apply the BCs in an incremental way, i.e. to have some sort of load stepping and update the values of the BCs with the load step (or one might have time-dependent BCs). It seems to me that this does not work if one uses the packed CUDADirichletBC returned from cudolfinx.assemble.CUDAAssembler.pack_bcs(), i.e. if updating the underlying dolfinx.fem.bcs.DirichletBC that this has no effect on the CUDADirichletBC.

@jonas-heinzmann The CUDADirichletBC object has an update method for precisely this reason.

I see, sorry for missing that! What do you think of calling it in the solve method such that the user doesn't need to take care of this? Then one could exchange the dolfinx-version of the NonlinearProblem class with the cudolfinx-version 1:1 without having to worry about extra function calls.

@bpachev
Copy link
Owner

bpachev commented Mar 12, 2026

@bpachev One common use case for non-linear problems is to apply the BCs in an incremental way, i.e. to have some sort of load stepping and update the values of the BCs with the load step (or one might have time-dependent BCs). It seems to me that this does not work if one uses the packed CUDADirichletBC returned from cudolfinx.assemble.CUDAAssembler.pack_bcs(), i.e. if updating the underlying dolfinx.fem.bcs.DirichletBC that this has no effect on the CUDADirichletBC.

@jonas-heinzmann The CUDADirichletBC object has an update method for precisely this reason.

I see, sorry for missing that! What do you think of calling it in the solve method such that the user doesn't need to take care of this? Then one could exchange the dolfinx-version of the NonlinearProblem class with the cudolfinx-version 1:1 without having to worry about extra function calls.

I think that is a great idea. I have very little documentation currently, so it is easy to miss. The method does require the list of boundary conditions that need to be updated, but it should be fine to update all of them just in case.

@jonas-heinzmann
Copy link
Author

Previously, the CUDAVector binding had a hardcoded include_ghosts=False which could lead to errors in the _assemble_residual method since there are necessary copies between u and x (the apply_lifting kernel would access x0 at ghost DOF indices, causing an illegal memory access error). Accordingly, I now added the keyword argument include_ghosts (default: False to have the same behavior as before per default) to the wrapper, and exposed the to_device() method to be called after the ghost update.

@jonas-heinzmann
Copy link
Author

@bpachev I believe this should be ready now, could you take another look? Thanks a lot!

Copy link
Owner

@bpachev bpachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the ghosting for x0 - this is another reason to rework the API to directly use PETSc Vectors. Would you be able to address the comments about copying b and using ghost_layer_mesh in the demo? Also, have you tested to ensure that the results on the GPU are the same as those on the CPU, particularly in parallel?

)

# copy assembled vector to PETSc work vector
self._cuda_b.vector.copy(b)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you create a CUDAVector for b like what you have for x and use that in the assembly routines, you can avoid this copy.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used nsys to profile the example you shared, with and without CUDAVector for b.
Original:

 ** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)            Operation           
 ----------  -----  --------  --------  --------  --------  -----------  ------------------------------
    541.068    653     0.829     0.003     0.000    65.059        4.935  [CUDA memcpy Host-to-Device]  
    192.048    453     0.424     0.042     0.002     3.903        0.531  [CUDA memcpy Device-to-Device]
    165.536    427     0.388     0.003     0.000     7.712        1.110  [CUDA memset]                 
    108.620    391     0.278     0.000     0.000    32.529        2.019  [CUDA memcpy Device-to-Host] 

Using CUDAVector instead of copying:

** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)            Operation           
 ----------  -----  --------  --------  --------  --------  -----------  ------------------------------
    541.068    653     0.829     0.003     0.000    65.059        4.935  [CUDA memcpy Host-to-Device]  
    179.314    441     0.407     0.042     0.002     3.903        0.528  [CUDA memcpy Device-to-Device]
    165.536    427     0.388     0.003     0.000     7.712        1.110  [CUDA memset]                 
    108.620    391     0.278     0.000     0.000    32.529        2.019  [CUDA memcpy Device-to-Host] 

The impact on runtime is pretty negligible since that is dominated by the linear solver. But it does avoid the copy and does not increase memory movement elsewhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I now wrapped the b provided by PETSc into a CUDAVector inside _assemble_residual.

def main(res: int = 30, degree: int = 1, dim: int = 3, cuda: bool = True):
"""solve a nonlinear problem on a CPU or GPU"""

domain = create_mesh(res, dim=dim)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for the demo to work in parallel, you need to do domain = cudolfinx.ghost_layer_mesh(domain), in the CUDA case with more than one process like in the Poisson example. Have you tested the correctness of your interface with multiple MPI processes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this just a few lines below that, so the demo is also working also in parallel. But good point, I added a test in python/test/test_nonlinear_problem.py to compare CPU and GPU results (working also in parallel).

… hence to avoid a copy in _assemble_residual function
@jonas-heinzmann jonas-heinzmann requested a review from bpachev March 17, 2026 21:48
@jonas-heinzmann jonas-heinzmann force-pushed the add-nonlinear-problem-snes-interface branch from ae754d5 to a887842 Compare March 17, 2026 21:50
@jonas-heinzmann jonas-heinzmann force-pushed the add-nonlinear-problem-snes-interface branch from a887842 to 5d59b61 Compare March 17, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants