Speedup C encoder up to 100x#256
Conversation
|
Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
Apple M1 Pro
* Result for M1 Pro was fixed, since previous results was affected by the bug. |
|
@DagAgren Are you interested in this improvements? |
|
I also improved decoder performance about 14 times using the same techniques: caching cos values, linearTosRGB values and unrolling loops. This improves performance of decoding from 6 Mpx/s to 86 Mpx/s on M1. This also introduces very minor change in output result. Nothing that could be noticed by human eye, just different binary output. The method which I use to measure performance is following: diff --git forkSrcPrefix/C/encode_stb.c forkDstPrefix/C/encode_stb.c
index 811ca00006b45eaa829bfd267904ac0d0c647884..a95c6a2ff96ee7cdaa9d1b35ef28b063161cf01d 100644
--- forkSrcPrefix/C/encode_stb.c
+++ forkDstPrefix/C/encode_stb.c
@@ -4,6 +4,7 @@
#include "stb_image.h"
#include <stdio.h>
+#include <time.h>
const char *blurHashForFile(int xComponents, int yComponents,const char *filename);
@@ -38,6 +39,14 @@ const char *blurHashForFile(int xComponents, int yComponents,const char *filenam
const char *hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
+ #define TIMES 30
+ clock_t start = clock();
+ for (int i = 0; i < TIMES; i++) {
+ hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
+ }
+ double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+ printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
stbi_image_free(data);
return hash;
diff --git forkSrcPrefix/C/decode_stb.c forkDstPrefix/C/decode_stb.c
index dab164e1eaf1a7199a751a5e13f6da7099027bd2..3514f53e6f91dc41253429ea07e594893d536598 100644
--- forkSrcPrefix/C/decode_stb.c
+++ forkDstPrefix/C/decode_stb.c
@@ -3,6 +3,8 @@
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_writer.h"
+#include <time.h>
+
int main(int argc, char **argv) {
if(argc < 5) {
fprintf(stderr, "Usage: %s hash width height output_file [punch]\n", argv[0]);
@@ -34,6 +36,15 @@ int main(int argc, char **argv) {
freePixelArray(bytes);
+ #define TIMES 30
+ clock_t start = clock();
+ for (int i = 0; i < TIMES; i++) {
+ uint8_t * tmpbytes = decode(hash, width, height, punch, nChannels);
+ freePixelArray(tmpbytes);
+ }
+ double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+ printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
fprintf(stdout, "Decoded blurhash successfully, wrote PNG file %s\n", output_file);
return 0;
} |
|
@DagAgren How can I earn your attention? |
|
@DagAgren please note that |
|
This is a breakthrough for this library. Why can't we merge it? @DagAgren ? |
|
Sorry I did not see this earlier. However, this code is written intentionally to be simple rather than performant, because it meant as a reference implementation that can be as easily ported as possible to other platforms. Also, it should not need high performance. You should not run it on a full-sized image, but instead first scale the image down to a much smaller size, such as 32x32, and run it on that. This is mentioned in the documentation. Running it on a full-scale image is not useful, as it throws away all that detail anyway. |
Does this mean you’re rejecting any performance improvements entirely, or only the more radical ones (like 4× loop unrolling)? Regarding the suggestion to scale the image down to 32×32 — that almost eliminates any benefit from sRGB → linear conversion. Performance improvements are still measurable even at that size. I used large images only to better demonstrate the effect; the same applies to small ones. |
All changes are divided by independent commits, some of them are optional.
In addition to improving performance there are changes:
M_PIin sources, ensure it defined inmath.h.blurhash_encoderexecutable (in line withblurHashForPixelsfunction)Makefileto avoid heavyencode_stbrecompilation on each change.Benchmarks are in the comment.