Speedup C encoder up to 100x by homm · Pull Request #256 · woltapp/blurhash

homm · 2024-09-25T08:57:07Z

All changes are divided by independent commits, some of them are optional.

In addition to improving performance there are changes:

Do not define M_PI in sources, ensure it defined in math.h.
Fixed max number of components for blurhash_encoder executable (in line with blurHashForPixels function)
Improved Makefile to avoid heavy encode_stb recompilation on each change.

Benchmarks are in the comment.

homm · 2024-10-03T20:40:54Z

~~I've also implemented SSE and NEON optimizations in separate branch.~~ The last optimization with unrolling loop in multiplyBasisFunction is actually works better since it allows any compiler effectively autovectorize the code. There are benchmarks for 2000 × 1334 jpeg image on different systems:

Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz

Optimization	GCC 13.2.1		Clang 17.0.6
	6 4	9 9	6 4	9 9
Master	3181 ms	11844 ms	3154 ms	11124 ms
sRGBToLinear_cache	381	1507	451	1633
cosX cache	82	339	88	270
Single pass	58	177	62	207
~~SSE~~ (obsolete)	39	114	42	144
Unroll 4x	30	80	32	85

Apple M1 Pro

Optimization	GCC 13.2.1		Clang 17.0.6		Clang 14.0.3
	6 4	9 9	6 4	9 9	6 4	9 9
Master	1177 ms	4076 ms	1156 ms	4005 ms	1268 ms	4302 ms
sRGBToLinear_cache	212	826	216	839	186	653
cosX cache	44	150	80	271	81	271
Single pass	20	62	32	57	29	70
~~NEON~~ (obsolete)	27	87	25	80	25	80
Unroll 4x	16	49	15	43	15	42

* Result for M1 Pro was fixed, since previous results was affected by the bug.

homm · 2024-10-11T12:23:28Z

@DagAgren Are you interested in this improvements?

homm · 2024-10-24T09:02:44Z

I also improved decoder performance about 14 times using the same techniques: caching cos values, linearTosRGB values and unrolling loops. This improves performance of decoding from 6 Mpx/s to 86 Mpx/s on M1.

$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 49.532 ms

$ touch decode.c && make blurhash_decoder && ./blurhash_decoder "W7E-z7oyM{8xM{wKwdMepHrE%LV[OVV@BBS\$r@NaR7OrRQNaMKXm" 640 480 _out.png
Time per 30 execution: 3.573 ms

This also introduces very minor change in output result. Nothing that could be noticed by human eye, just different binary output.

The method which I use to measure performance is following:

diff --git forkSrcPrefix/C/encode_stb.c forkDstPrefix/C/encode_stb.c
index 811ca00006b45eaa829bfd267904ac0d0c647884..a95c6a2ff96ee7cdaa9d1b35ef28b063161cf01d 100644
--- forkSrcPrefix/C/encode_stb.c
+++ forkDstPrefix/C/encode_stb.c
@@ -4,6 +4,7 @@
 #include "stb_image.h"
 
 #include <stdio.h>
+#include <time.h>
 
 const char *blurHashForFile(int xComponents, int yComponents,const char *filename);
 
@@ -38,6 +39,14 @@ const char *blurHashForFile(int xComponents, int yComponents,const char *filenam
 
 	const char *hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
 
+	#define TIMES 30
+	clock_t start = clock();
+    for (int i = 0; i < TIMES; i++) {
+        hash = blurHashForPixels(xComponents, yComponents, width, height, data, width * 3);
+    }
+    double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+    printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
 	stbi_image_free(data);
 
 	return hash;
diff --git forkSrcPrefix/C/decode_stb.c forkDstPrefix/C/decode_stb.c
index dab164e1eaf1a7199a751a5e13f6da7099027bd2..3514f53e6f91dc41253429ea07e594893d536598 100644
--- forkSrcPrefix/C/decode_stb.c
+++ forkDstPrefix/C/decode_stb.c
@@ -3,6 +3,8 @@
 #define STB_IMAGE_WRITE_IMPLEMENTATION
 #include "stb_writer.h"
 
+#include <time.h>
+
 int main(int argc, char **argv) {
 	if(argc < 5) {
 		fprintf(stderr, "Usage: %s hash width height output_file [punch]\n", argv[0]);
@@ -34,6 +36,15 @@ int main(int argc, char **argv) {
 
 	freePixelArray(bytes);
 
+	#define TIMES 30
+	clock_t start = clock();
+    for (int i = 0; i < TIMES; i++) {
+    	uint8_t * tmpbytes = decode(hash, width, height, punch, nChannels);
+    	freePixelArray(tmpbytes);
+    }
+    double time_ms = (double)(clock() - start) / CLOCKS_PER_SEC / TIMES;
+    printf("Time per %d execution: %.3f ms\n", TIMES, time_ms * 1000);
+
 	fprintf(stdout, "Decoded blurhash successfully, wrote PNG file %s\n", output_file);
 	return 0;
 }

homm · 2024-10-30T18:54:37Z

@DagAgren How can I earn your attention?

vellnes · 2024-12-04T15:39:44Z

@DagAgren please note that
We will be very grateful for the optimization of the algorithm

jonybekov · 2025-10-28T22:54:49Z

This is a breakthrough for this library. Why can't we merge it? @DagAgren ?

DagAgren · 2025-10-29T16:18:48Z

Sorry I did not see this earlier. However, this code is written intentionally to be simple rather than performant, because it meant as a reference implementation that can be as easily ported as possible to other platforms.

Also, it should not need high performance. You should not run it on a full-sized image, but instead first scale the image down to a much smaller size, such as 32x32, and run it on that. This is mentioned in the documentation. Running it on a full-scale image is not useful, as it throws away all that detail anyway.

homm · 2025-10-30T09:55:47Z

However, this code is written intentionally to be simple rather than performant

Does this mean you’re rejecting any performance improvements entirely, or only the more radical ones (like 4× loop unrolling)?

Regarding the suggestion to scale the image down to 32×32 — that almost eliminates any benefit from sRGB → linear conversion.

Performance improvements are still measurable even at that size. I used large images only to better demonstrate the effect; the same applies to small ones.

homm added 8 commits September 25, 2024 10:56

Define math consts

9c138a9

Fix number of arguments in blurhash_encoder

3152f57

Show main warnings like unused variables

b027b16

Build object files separate for compilation speedup

7fe900e

Use sRGBToLinear_cache (4.5x speedup)

6af77c9

cosX cache (5.6x speedup)

d936afb

Prepare cosX && cosY once for all passes

2e19ea7

Calculate factors in one call (up to 1.6x speedup)

d3d26c1

homm mentioned this pull request Sep 25, 2024

Optimization encoding in 124 times woltapp/blurhash-python#25

Open

unroll multiplyBasisFunction loop (2.5x speedup)

52d4a62

homm changed the title ~~Speedup C encoder by factor of 40~~ Speedup C encoder up to 100x Oct 11, 2024

homm added 5 commits October 18, 2024 12:43

Assign sRGBToLinear_cache after population to avoid races

89a2524

decoder: cosf is about 17% faster

3ab38fe

decoder: cache cos (3.2x faster)

6f02d6e

decode: unroll inner loop (20% faster)

0f2d8c9

decode: Cache linearTosRGB (2.75x speedup)

d2a09cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup C encoder up to 100x#256

Speedup C encoder up to 100x#256
homm wants to merge 14 commits into
woltapp:masterfrom
homm:optimization

homm commented Sep 25, 2024 •

edited

Loading

Uh oh!

homm commented Oct 3, 2024 •

edited

Loading

Uh oh!

homm commented Oct 11, 2024

Uh oh!

homm commented Oct 24, 2024 •

edited

Loading

Uh oh!

homm commented Oct 30, 2024

Uh oh!

vellnes commented Dec 4, 2024

Uh oh!

jonybekov commented Oct 28, 2025

Uh oh!

DagAgren commented Oct 29, 2025

Uh oh!

homm commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

homm commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

homm commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz

Apple M1 Pro

Uh oh!

homm commented Oct 11, 2024

Uh oh!

homm commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

homm commented Oct 30, 2024

Uh oh!

vellnes commented Dec 4, 2024

Uh oh!

jonybekov commented Oct 28, 2025

Uh oh!

DagAgren commented Oct 29, 2025

Uh oh!

homm commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

homm commented Sep 25, 2024 •

edited

Loading

homm commented Oct 3, 2024 •

edited

Loading

homm commented Oct 24, 2024 •

edited

Loading