Conversation
|
|
||
| _transform(row, encoding, callback) { | ||
| if (this.rowTransform.write(row)) { | ||
| process.nextTick(callback); |
There was a problem hiding this comment.
@jirimoravcik any chance you would know why this was here? Added in https://github.com/apify/xlsx-stream/pull/16/changes without much explanation.. 😅 Maybe the idea was to do exactly what I'm saying in the last paragraph of the PR description?
There was a problem hiding this comment.
I honestly do not remember, sorry 😅
5 scenarios with different write/read patterns to compare throughput before and after the backpressure PR. Run with: node perf-run.js all https://claude.ai/code/session_01Swdnx6RZrfi2p2PMUezqmk
|
Did some clauding and found no performance issues (and a lot of unimportant bs), I think this should be ready to merge like this. |
|
aaaand now I can't even replicate the issue again, [redacted because i'm in a public repo and I shouldn't swear here] |
|
|
||
| ## Internals documentation | ||
|
|
||
| A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream. |
There was a problem hiding this comment.
| A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream. | |
| A `.xlsx` file is a ZIP archive with a well-defined structure. The main class (`XLSXTransformStream`) creates this archive by first adding a few static metadata files, then streaming the actual content into a `sheet1.xml` entry. The resulting ZIP stream becomes the output of XLSXTransformStream. |
|
|
||
| A .xlsx file is essentialy a ZIP archive with some well defined format. So the main class (`XLSXTransformStream`) creates a ZIP file, adds a few static files that don't change based on the content, and then adds a streaming `sheet1.xml` entry, into which the transformed content is written. The streamed output of the ZIP archive is then forwarded as the output of the XLSXTransformStream. | ||
|
|
||
| Image is worth a thousand words: |
There was a problem hiding this comment.
| Image is worth a thousand words: | |
| See the diagram below for a visual overview: |
Meaning you are not able to replicate the backpressure issue? So it might not even be there or the root cause could be different? Anyway I take it as that my review is not needed right now. |
A bit of context copied over from https://github.com/apify/apify-core/issues/26426
A diagram of what is happening inside this package: (should I put it to README.md or somewhere?)
After a lot of testing and reading the code, I realized that backpressure is missing on the ZIP -> output part.
For
Transformstreams, Node.js usually checks the_readableState.bufferlength, and if it exceeds the highWaterMark, it correctly applies automatic backpressure. That's why common transform streams usually don't need to deal with this, just callcallback()after all derived chunks have been pushed. But this doesn't apply if nothing is pushed from_transformbefore the callback [Node.js stream library code]. Which is exactly our case - we callthis.pushcompletely out of band, wheneverzip.on('data')fires. Thus, we need to make this fix.This patch should, however, be performance-tested. Because what happens now is that when backpressure is applied, everything from the readable buffer (items stored by
this.push) has to be consumed before the zip stream is resumed. So we create a kind of "waves" of backpressure/complete drain. Which maybe could be fine, but maybe could also cause slowdown, if we have slow (async) consumers and producers using this library.So, I was also thinking about an alternative option: the pipeline drawn above is (/should be?) completely synchronous - we're just encoding data into xml and then compressing it. So maybe if we just delay the callback an extra tick or two, we could have the data flow through the whole pipeline before the callback, and Node.js native backpressure would do the work for us. But this feels quite hacky..