fix: implement proper backpressure by mvolfik · Pull Request #32 · apify/xlsx-stream

mvolfik · 2026-03-19T16:48:51Z

A bit of context copied over from https://github.com/apify/apify-core/issues/26426

A diagram of what is happening inside this package: (should I put it to README.md or somewhere?)

                      [ * ] (Start)
                        |
                        | Dataset rows
                        v
                +---------------------------+      Output bytes
                |    XLSXTransformStream    |----------------------> [*] (End)
                +---------------------------+
                  |               |      ^
     Dataset rows |         Other |      | ZIP.on("data", (bytes)
                  |         files |      |   => this.push(bytes))
                  v               v      |
        +------------------+  +------------------+
        | XLSXRowTransform |  |                  |
        +------------------+  |                  |
                  |           |                  |
             Text |           |     ZIP file     |
                  v           |                  |
        +------------------+  |                  |
        |   sheet1.xml     |  |                  |
        |   ZIP entry      |  |                  |
        +------------------+  +------------------+
                  |                   ^
                  |   Is included in  |
                  +-------------------+

After a lot of testing and reading the code, I realized that backpressure is missing on the ZIP -> output part.

For Transform streams, Node.js usually checks the _readableState.buffer length, and if it exceeds the highWaterMark, it correctly applies automatic backpressure. That's why common transform streams usually don't need to deal with this, just call callback() after all derived chunks have been pushed. But this doesn't apply if nothing is pushed from _transform before the callback [Node.js stream library code]. Which is exactly our case - we call this.push completely out of band, whenever zip.on('data') fires. Thus, we need to make this fix.

This patch should, however, be performance-tested. Because what happens now is that when backpressure is applied, everything from the readable buffer (items stored by this.push) has to be consumed before the zip stream is resumed. So we create a kind of "waves" of backpressure/complete drain. Which maybe could be fine, but maybe could also cause slowdown, if we have slow (async) consumers and producers using this library.

So, I was also thinking about an alternative option: the pipeline drawn above is (/should be?) completely synchronous - we're just encoding data into xml and then compressing it. So maybe if we just delay the callback an extra tick or two, we could have the data flow through the whole pipeline before the callback, and Node.js native backpressure would do the work for us. But this feels quite hacky..

mvolfik · 2026-03-19T16:51:34Z

src/XLSXTransformStream.js


    _transform(row, encoding, callback) {
        if (this.rowTransform.write(row)) {
-            process.nextTick(callback);


@jirimoravcik any chance you would know why this was here? Added in https://github.com/apify/xlsx-stream/pull/16/changes without much explanation.. 😅 Maybe the idea was to do exactly what I'm saying in the last paragraph of the PR description?

I honestly do not remember, sorry 😅

5 scenarios with different write/read patterns to compare throughput before and after the backpressure PR. Run with: node perf-run.js all https://claude.ai/code/session_01Swdnx6RZrfi2p2PMUezqmk

mvolfik · 2026-03-24T10:27:19Z

Did some clauding and found no performance issues (and a lot of unimportant bs), I think this should be ready to merge like this.

mvolfik · 2026-03-24T11:27:33Z

aaaand now I can't even replicate the issue again, [redacted because i'm in a public repo and I shouldn't swear here]

jirimoravcik

Great job, thanks

jirimoravcik · 2026-03-24T12:59:06Z

README.md


+## Internals documentation
+
+A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.


Suggested change

A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.

A `.xlsx` file is a ZIP archive with a well-defined structure. The main class (`XLSXTransformStream`) creates this archive by first adding a few static metadata files, then streaming the actual content into a `sheet1.xml` entry. The resulting ZIP stream becomes the output of XLSXTransformStream.

jirimoravcik · 2026-03-24T12:59:58Z

README.md

+
+A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.
+
+Image is worth a thousand words:


Suggested change

Image is worth a thousand words:

See the diagram below for a visual overview:

tobice · 2026-03-25T08:52:56Z

@mvolfik

aaaand now I can't even replicate the issue again,

Meaning you are not able to replicate the backpressure issue? So it might not even be there or the root cause could be different?

Anyway I take it as that my review is not needed right now.

mvolfik added 2 commits March 13, 2026 17:33

fix: implement proper backpressure

b1d42c5

call callback immediately

a113d16

mvolfik self-assigned this Mar 19, 2026

github-actions bot added this to the 136th sprint - Core Services team milestone Mar 19, 2026

github-actions bot added the t-core-services Issues with this label are in the ownership of the core services team. label Mar 19, 2026

mvolfik commented Mar 19, 2026

View reviewed changes

mvolfik added 2 commits March 23, 2026 17:27

update eslint, fix lints

e6e6f25

fix: resume stream on flush

420d333

github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 23, 2026

mvolfik requested review from jirimoravcik and tobice March 24, 2026 10:27

add schematic to readme

5b12ad1

mvolfik removed request for jirimoravcik and tobice March 24, 2026 11:26

jirimoravcik approved these changes Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: implement proper backpressure#32

fix: implement proper backpressure#32
mvolfik wants to merge 5 commits intomasterfrom
fix/backpressure

mvolfik commented Mar 19, 2026

Uh oh!

mvolfik Mar 19, 2026

Uh oh!

jirimoravcik Mar 19, 2026

Uh oh!

mvolfik commented Mar 24, 2026

Uh oh!

mvolfik commented Mar 24, 2026 •

edited

Loading

Uh oh!

jirimoravcik left a comment

Uh oh!

jirimoravcik Mar 24, 2026

Uh oh!

jirimoravcik Mar 24, 2026

Uh oh!

tobice commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		## Internals documentation

		A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.

	A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.
	A `.xlsx` file is a ZIP archive with a well-defined structure. The main class (`XLSXTransformStream`) creates this archive by first adding a few static metadata files, then streaming the actual content into a `sheet1.xml` entry. The resulting ZIP stream becomes the output of XLSXTransformStream.


		A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream.

		Image is worth a thousand words:

	Image is worth a thousand words:
	See the diagram below for a visual overview:

Conversation

mvolfik commented Mar 19, 2026

Uh oh!

mvolfik Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

jirimoravcik Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

mvolfik commented Mar 24, 2026

Uh oh!

mvolfik commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jirimoravcik left a comment

Choose a reason for hiding this comment

Uh oh!

jirimoravcik Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

jirimoravcik Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

tobice commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mvolfik commented Mar 24, 2026 •

edited

Loading