support max_concurrency in upload_blob and download_blob operations#420
Merged
TomAugspurger merged 2 commits intofsspec:mainfrom Aug 5, 2023
Merged
support max_concurrency in upload_blob and download_blob operations#420TomAugspurger merged 2 commits intofsspec:mainfrom
max_concurrency in upload_blob and download_blob operations#420TomAugspurger merged 2 commits intofsspec:mainfrom
Conversation
b329028 to
a7eeb89
Compare
pmrowla
commented
Aug 3, 2023
| account_name=storage.account_name, connection_string=CONN_STR | ||
| account_name=storage.account_name, | ||
| connection_string=CONN_STR, | ||
| max_concurrency=1, |
Contributor
Author
There was a problem hiding this comment.
Concurrency cannot be used for this test, otherwise storage.insert_time will be the timestamp of the first completed chunk and creation_time will be the timestamp of the finished operation (after all chunks are uploaded).
efiop
approved these changes
Aug 4, 2023
Member
|
@hayesgb @TomAugspurger Hey folks. Could you take a look, please? Just want to make sure you are fine with this. |
TomAugspurger
approved these changes
Aug 4, 2023
Contributor
TomAugspurger
left a comment
There was a problem hiding this comment.
Looks good. Just one question but feel free to merge.
Member
|
Btw, for the record, some comparisons to show how much this patch speeds things up for us treeverse/dvc-azure#54 (comment) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
max_concurrency=Nonekwarg in async fs methods that use the azure SDKupload_blobanddownload_blob(which azure then uses to parallelize async chunk uploads/downloads)AzureBlobFileSystem.max_concurrencyattribute which is used whenever method-levelmax_concurrencyis not setfsspec.asyn._get_batch_size()max_concurrencydefaults to 1 for the individual file upload/download.batch_size=...is used to parallelize uploads/downloads at the file level (in something likefs._get()orfs._put()), and no additional parallelization is done for chunks within each file.batch_sizeandmax_concurrency. i.e.fs.get(path, batch_size=4, max_concurrency=2)would download up to 4 files at a time, and up to 2 chunks at a time within each file, giving an overall concurrency of up to 8 async download coroutines being run in the loop at a time.Closes #268 (and supercedes the changes in the
concurrent_iobranch)This PR is incompatible with #329 (but there was discussion in that PR regarding changing the name of the parameter used to something other than
max_concurrencysince it conflicts with the the azure SDK parameter)