Skip to content

gh-142183: Cache one datachunk per tstate to prevent alloc/dealloc thrashing#145789

Merged
Yhg1s merged 3 commits intopython:mainfrom
Yhg1s:cache-datachunk
Mar 11, 2026
Merged

gh-142183: Cache one datachunk per tstate to prevent alloc/dealloc thrashing#145789
Yhg1s merged 3 commits intopython:mainfrom
Yhg1s:cache-datachunk

Conversation

@Yhg1s
Copy link
Member

@Yhg1s Yhg1s commented Mar 11, 2026

Cache one datachunk per tstate to prevent alloc/dealloc thrashing when repeatedly hitting the same call depth at exactly the wrong boundary.

repeatedly hitting the same call depth at exactly the wrong boundary.
@Yhg1s
Copy link
Member Author

Yhg1s commented Mar 11, 2026

Just to be clear: this is effectively a freelist of 1, and there's still an easily crafted (but much less likely in reality, I would argue) case where two (or more) stack chunks are repeatedly allocated and deallocated. That requires a much larger chain of calls -- or much larger functions -- so it's not as pronounced, but crafting code to hit that exact case isn't hard. It shows a ~15% penalty for being at just the wrong stack depth, compared to 35+% for the single chunk case. I considered making the cached chunk a freelist (which would be easy since they're already a linked list) but this would mean keeping all datastack chunks of a thread alive for the entire duration of a thread, which might not be a good idea. Caching a single chunk seems like a reasonable compromise.

Here's some benchmark results using the repro I provided in the issue, run on a not particularly quiet machine so the results are a little noisy. 55 is the stack depth level that triggers the bad case, 56 is one level deeper (so slightly more work) and avoids it:

% hyperfine --warmup 3 './base/python repro.py' './fixed/python repro.py'
Benchmark 1: ./base/python repro.py
  Time (mean ± σ):      24.1 ms ±   3.8 ms    [User: 15.9 ms, System: 7.7 ms]
  Range (min … max):    20.9 ms …  41.8 ms    112 runs

Benchmark 2: ./fixed/python repro.py
  Time (mean ± σ):      18.4 ms ±   2.0 ms    [User: 14.8 ms, System: 3.3 ms]
  Range (min … max):    16.4 ms …  28.3 ms    150 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./fixed/python repro.py ran
    1.31 ± 0.25 times faster than ./base/python repro.py
%  hyperfine --warmup 3 './base/python repro.py 55' './base/python repro.py 56'
Benchmark 1: ./base/python repro.py 55
  Time (mean ± σ):      21.6 ms ±   2.4 ms    [User: 14.6 ms, System: 6.7 ms]
  Range (min … max):    19.5 ms …  31.5 ms    128 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./base/python repro.py 56
  Time (mean ± σ):      16.7 ms ±   1.3 ms    [User: 13.7 ms, System: 2.8 ms]
  Range (min … max):    15.3 ms …  25.1 ms    165 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./base/python repro.py 56 ran
    1.30 ± 0.17 times faster than ./base/python repro.py 55
% hyperfine --warmup 3 './fixed/python repro.py 55' './fixed/python repro.py 56'
Benchmark 1: ./fixed/python repro.py 55
  Time (mean ± σ):      17.1 ms ±   1.5 ms    [User: 13.8 ms, System: 3.1 ms]
  Range (min … max):    15.6 ms …  24.6 ms    164 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./fixed/python repro.py 56
  Time (mean ± σ):      17.1 ms ±   2.0 ms    [User: 13.8 ms, System: 3.1 ms]
  Range (min … max):    15.8 ms …  26.4 ms    170 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  ./fixed/python repro.py 55 ran
    1.00 ± 0.15 times faster than ./fixed/python repro.py 56

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for fixing this.

Do you want to backport this to 3.14 and maybe 3.13?

Note:
Although this is a solid fix for 3.14 and 31.3, we'll probably want to use a resizable stack for 3.15+ to avoid deopts in the SAI and JIT when operating at the edge of a chunk

@Yhg1s Yhg1s added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Mar 11, 2026
@Yhg1s
Copy link
Member Author

Yhg1s commented Mar 11, 2026

I think we should backport, yes, unless @hugovk thinks it's a bad idea.

@Yhg1s Yhg1s merged commit 706fd4e into python:main Mar 11, 2026
58 checks passed
@miss-islington-app
Copy link

Thanks @Yhg1s for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 11, 2026
…loc thrashing (pythonGH-145789)

Cache one datachunk per tstate to prevent alloc/dealloc thrashing when repeatedly hitting the same call depth at exactly the wrong boundary.

---------
(cherry picked from commit 706fd4e)

Co-authored-by: T. Wouters <thomas@python.org>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Mar 11, 2026
…loc thrashing (pythonGH-145789)

Cache one datachunk per tstate to prevent alloc/dealloc thrashing when repeatedly hitting the same call depth at exactly the wrong boundary.

---------
(cherry picked from commit 706fd4e)

Co-authored-by: T. Wouters <thomas@python.org>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
@bedevere-app
Copy link

bedevere-app bot commented Mar 11, 2026

GH-145828 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Mar 11, 2026
@bedevere-app
Copy link

bedevere-app bot commented Mar 11, 2026

GH-145829 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants