Skip to content

merge upstream caches via os.scandir + os.link#137

Open
LarsMichelsen wants to merge 1 commit into
bazel-contrib:mainfrom
LarsMichelsen:py-hardlink-merge
Open

merge upstream caches via os.scandir + os.link#137
LarsMichelsen wants to merge 1 commit into
bazel-contrib:mainfrom
LarsMichelsen:py-hardlink-merge

Conversation

@LarsMichelsen
Copy link
Copy Markdown

Walking each upstream cache in Python with os.walk and copying every entry via shutil.copy is slow on deep dependency graphs and duplicates upstream content on every downstream action, blowing up exec-root size and saturating disk write bandwidth.

This change reduces the wall time of some targets with deeper nested dependencies by 60%. But that varies greatly depending on the target.

Replace the loop with a recursive os.scandir + os.link traversal.

  • os.scandir walks at C speed, much faster than os.walk on deep trees and avoids stat call per entry.
  • Hardlinks share inodes, so each downstream cache adds essentially zero unique disk blocks rather than re-copying transitive content.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3734a933a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread mypy/private/mypy_runner.py Outdated
Walking each upstream cache in Python with os.walk and copying every
entry via shutil.copy is slow on deep dependency graphs and duplicates
upstream content on every downstream action, blowing up exec-root size
and saturating disk write bandwidth.

This change reduces the wall time of some targets with deeper nested
dependencies by 60%. But that varies greatly depending on the target.

Replace the loop with a recursive os.scandir + os.link traversal.

- os.scandir walks at C speed, much faster than os.walk on deep trees
  and avoids stat call per entry.
- Hardlinks share inodes, so each downstream cache adds essentially
  zero unique disk blocks rather than re-copying transitive content.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant