Skip to content

fix: implement proper backpressure#32

Open
mvolfik wants to merge 5 commits intomasterfrom
fix/backpressure
Open

fix: implement proper backpressure#32
mvolfik wants to merge 5 commits intomasterfrom
fix/backpressure

Conversation

@mvolfik
Copy link
Copy Markdown
Member

@mvolfik mvolfik commented Mar 19, 2026

A bit of context copied over from https://github.com/apify/apify-core/issues/26426

A diagram of what is happening inside this package: (should I put it to README.md or somewhere?)

                      [ * ] (Start)
                        |
                        | Dataset rows
                        v
                +---------------------------+      Output bytes
                |    XLSXTransformStream    |----------------------> [*] (End)
                +---------------------------+
                  |               |      ^
     Dataset rows |         Other |      | ZIP.on("data", (bytes)
                  |         files |      |   => this.push(bytes))
                  v               v      |
        +------------------+  +------------------+
        | XLSXRowTransform |  |                  |
        +------------------+  |                  |
                  |           |                  |
             Text |           |     ZIP file     |
                  v           |                  |
        +------------------+  |                  |
        |   sheet1.xml     |  |                  |
        |   ZIP entry      |  |                  |
        +------------------+  +------------------+
                  |                   ^
                  |   Is included in  |
                  +-------------------+

After a lot of testing and reading the code, I realized that backpressure is missing on the ZIP -> output part.

For Transform streams, Node.js usually checks the _readableState.buffer length, and if it exceeds the highWaterMark, it correctly applies automatic backpressure. That's why common transform streams usually don't need to deal with this, just call callback() after all derived chunks have been pushed. But this doesn't apply if nothing is pushed from _transform before the callback [Node.js stream library code]. Which is exactly our case - we call this.push completely out of band, whenever zip.on('data') fires. Thus, we need to make this fix.

This patch should, however, be performance-tested. Because what happens now is that when backpressure is applied, everything from the readable buffer (items stored by this.push) has to be consumed before the zip stream is resumed. So we create a kind of "waves" of backpressure/complete drain. Which maybe could be fine, but maybe could also cause slowdown, if we have slow (async) consumers and producers using this library.

So, I was also thinking about an alternative option: the pipeline drawn above is (/should be?) completely synchronous - we're just encoding data into xml and then compressing it. So maybe if we just delay the callback an extra tick or two, we could have the data flow through the whole pipeline before the callback, and Node.js native backpressure would do the work for us. But this feels quite hacky..

@mvolfik mvolfik self-assigned this Mar 19, 2026
@github-actions github-actions bot added the t-core-services Issues with this label are in the ownership of the core services team. label Mar 19, 2026

_transform(row, encoding, callback) {
if (this.rowTransform.write(row)) {
process.nextTick(callback);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jirimoravcik any chance you would know why this was here? Added in https://github.com/apify/xlsx-stream/pull/16/changes without much explanation.. 😅 Maybe the idea was to do exactly what I'm saying in the last paragraph of the PR description?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly do not remember, sorry 😅

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 23, 2026
mvolfik pushed a commit that referenced this pull request Mar 23, 2026
5 scenarios with different write/read patterns to compare throughput
before and after the backpressure PR. Run with: node perf-run.js all

https://claude.ai/code/session_01Swdnx6RZrfi2p2PMUezqmk
@mvolfik
Copy link
Copy Markdown
Member Author

mvolfik commented Mar 24, 2026

Did some clauding and found no performance issues (and a lot of unimportant bs), I think this should be ready to merge like this.

@mvolfik mvolfik requested review from jirimoravcik and tobice March 24, 2026 10:27
@mvolfik
Copy link
Copy Markdown
Member Author

mvolfik commented Mar 24, 2026

aaaand now I can't even replicate the issue again, [redacted because i'm in a public repo and I shouldn't swear here]

Copy link
Copy Markdown
Member

@jirimoravcik jirimoravcik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, thanks


## Internals documentation

A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.
A `.xlsx` file is a ZIP archive with a well-defined structure. The main class (`XLSXTransformStream`) creates this archive by first adding a few static metadata files, then streaming the actual content into a `sheet1.xml` entry. The resulting ZIP stream becomes the output of XLSXTransformStream.


A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.

Image is worth a thousand words:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Image is worth a thousand words:
See the diagram below for a visual overview:

@tobice
Copy link
Copy Markdown

tobice commented Mar 25, 2026

@mvolfik

aaaand now I can't even replicate the issue again,

Meaning you are not able to replicate the backpressure issue? So it might not even be there or the root cause could be different?

Anyway I take it as that my review is not needed right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-core-services Issues with this label are in the ownership of the core services team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants