Skip to content

fix: add shutdown methods to executors#925

Open
jbusecke wants to merge 14 commits intomainfrom
executor-cleaning
Open

fix: add shutdown methods to executors#925
jbusecke wants to merge 14 commits intomainfrom
executor-cleaning

Conversation

@jbusecke
Copy link
Collaborator

@jbusecke jbusecke commented Mar 12, 2026

@TomNicholas and I have been mulling over a complex native zarr ingestion job for a few days now. We were ingesting many batches of large (~1TB) native zarr stores, and saw a steady increase of memory which indicated that 'something' was holding onto memory in between batches. This PR adds tests to catch this behavior and a fix for the lithops executor that did fix our problem for now.

  • The dataset we are using is currently not public. I would like to demonstrate the base issue fully reproducibly. If anyone knows a ~1TB/300k chunks native zarr store in an anon bucket, please let me know.

  • Closes Lithops FunctionExecutor memory leaks: atexit handler + unbounded futures list #926

  • Tests added

  • Tests passing

  • No test coverage regression

  • Full type hint coverage

  • Changes are documented in docs/releases.md

  • New functions/methods are listed in an appropriate *.md file under docs/api

  • New functionality has documentation

jbusecke and others added 2 commits March 12, 2026 12:26
Add explicit shutdown() to SerialExecutor and DaskDelayedExecutor that
clears tracked futures. Enhance LithopsEagerFunctionExecutor.shutdown()
to clear cached _call_output on ResponseFutures before closing, preventing
memory accumulation across repeated map() calls. Add parametrized tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 40.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.62%. Comparing base (bdc1a3a) to head (a612af5).

Files with missing lines Patch % Lines
virtualizarr/parallel.py 40.00% 6 Missing ⚠️

❌ Your patch check has failed because the patch coverage (40.00%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #925      +/-   ##
==========================================
- Coverage   89.23%   88.62%   -0.61%     
==========================================
  Files          33       33              
  Lines        2025     2031       +6     
==========================================
- Hits         1807     1800       -7     
- Misses        218      231      +13     
Files with missing lines Coverage Δ
virtualizarr/parallel.py 64.10% <40.00%> (-25.56%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.



@pytest.mark.parametrize("executor_cls", ALL_EXECUTORS)
class TestExecutorMemory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if either of these tests will be reliable enough - curious of @chuckwondo 's thoughts.

@jbusecke jbusecke marked this pull request as ready for review March 12, 2026 19:50
Comment on lines +391 to +394
# Lithops registers self.clean as an atexit handler (executors.py __init__),
# which prevents the FunctionExecutor from ever being garbage collected.
# Unregister it so the executor can be freed after shutdown.
atexit.unregister(self.lithops_client.clean)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely wild and deserves raising upstream

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #926



@pytest.mark.parametrize("executor_cls", ALL_EXECUTORS)
class TestExecutorMemory:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure these really belong here. They seem like tests that should occur upstream. If we see memory leaks in the upstream executors, we should probably be opening bugs against the appropriate repositories, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, see #926 for discussion of what we should do or not do to clean up

Copy link
Collaborator Author

@jbusecke jbusecke Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right in principal, but I would propose to keep this around as at least an optional test due to the significant work that was needed to get to the bottom of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @chuckwondo I see these tests as hopefully-temporary, but unfortunately important.

@jbusecke
Copy link
Collaborator Author

Ok Tom and I actually worked on an alternative approach where we change the lithops config to set lithops.data_cleaner to false (this is true by default and triggers the atexit registration). Combined with the added .shutdown() method on the Lithops exec this solves the problem in #926 and seems a bunch nicer than the original approach.

I have limited this to when the backend is localhost so that we leave the serverless behavior untouched for now. We could easily extend this if a user finds this error with other backends.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good, but before releasing it I want to:

  • confirm with @jbusecke that nothing else puzzling has come up wrt this rabbit hole,
  • raise an upstream issue on lithops in case we're missing something important here.

@TomNicholas TomNicholas mentioned this pull request Mar 16, 2026
@jbusecke
Copy link
Collaborator Author

@TomNicholas I just looked into the failing Minimum Version Test and it seems that lithops is installed with the latest version 3.6.4?

a) That does not seems like the behavior we want from the test?
b) Worries me that the fix is somehow dependent on the python version or some other library?

@jbusecke
Copy link
Collaborator Author

Aha! now the 3.11 tests also fail.

@jbusecke jbusecke mentioned this pull request Mar 16, 2026
4 tasks
@jbusecke
Copy link
Collaborator Author

Waiting for #932 to make sure that this is limited to python 3.11, but also wondering if this is another example of #933?

@jbusecke
Copy link
Collaborator Author

jbusecke commented Mar 16, 2026

The 3.12 and 3.13 tests fails with that VirtualiZarrDatasetAccessor import error we saw last week in the office @tom.

The 3.11 min-deps test does not mention this import error, but there is something failing within lithops, so it is not just an issue of not releasing the caches IIUC.

@jbusecke
Copy link
Collaborator Author

Rerunning them now.

@TomNicholas
Copy link
Member

Try adding the @pytest.mark.flaky decorators from #934 @jbusecke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lithops FunctionExecutor memory leaks: atexit handler + unbounded futures list

3 participants