Smaller stack usage for SHA-1, SHA-256 and SHA-512. by MarekKnapek · Pull Request #709 · libtom/libtomcrypt

MarekKnapek · 2025-11-29T10:05:49Z

Checklist

documentation is added or updated
tests are added or updated

sjaeckel

That looks interesting. Thanks for the next PR :)

When looking at it it seems like we'd be trading computation in space to computation in time, meaning that the execution should be slower after the patch applied.

So I modified the timing demo a bit to show something relevant, and the before looks as follows:

sha512              : Process at    39
sha512-256          : Process at    39
sha384              : Process at    39
sha512-224          : Process at    39
sha1                : Process at    61
sha256              : Process at   122
sha224              : Process at   122

vs. after this patch applied:

sha512              : Process at    39
sha384              : Process at    40
sha512-256          : Process at    40
sha512-224          : Process at    40
sha1                : Process at    68
sha224              : Process at   106
sha256              : Process at   106

sha1 really got worse, sha512-based stayed more or less the same (maybe a little bit slower), but sha256-based got significantly better performance!?

Not sure what to do with sha1, maybe enable this patch via a new LTC_SMALL_STACK option?
The other two I'd simply take unconditionally.

What do you think?

MarekKnapek · 2025-12-01T17:50:52Z

I think my next PR will be about x86 (and amd64) specific intrinsics. Making the SHA-1, SHA-256 and SHA-512 much, much faster.

How do I run these benchmarks myself? I could play with the code a bit more, maybe adding an if(i >= 16){ Wi(i) } condition into the loops.
Also what configuration option did you use? I mean with LTC_SMALL_CODE or without?
And how do I enable/disable this option at compile time? So far I manually edited some global header file to enable/disable this option. But I believe there might be some more kosher way of toggling this option.

sjaeckel · 2025-12-02T08:33:25Z

How do I run these benchmarks myself? I could play with the code a bit more, maybe adding an if(i >= 16){ Wi(i) } condition into the loops.

That's the timinig demo in demos/timing.c. I've just pushed an update to it and ran it as ./timing hash sha, then removed the sha3 parts manually when pasting its output here.

And how do I enable/disable this option at compile time?

That depends on how you build the library.

I usually simply run make, so it's a matter of make -j$(($(nproc)*2+1)) timing CFLAGS="-DLTC_SMALL_CODE".

If you use CMake (and build in a folder inside the ltc folder) it'd be cmake -DLTC_CFLAGS="-DLTC_SMALL_CODE" -DCMAKE_BUILD_TYPE=Release -DBUILD_USABLE_DEMOS=On .. && make -j$(($(nproc)*2+1)), then run ./demos/ltc-timing hash sha.

Also what configuration option did you use? I mean with LTC_SMALL_CODE or without?

Those previous tests were done with the standard config. With LTC_SMALL_CODE enabled they look like this:

Before the patch:

sha512              : Process at    39
sha384              : Process at    39
sha512-256          : Process at    40
sha512-224          : Process at    40
sha1                : Process at    84
sha256              : Process at   108
sha224              : Process at   108

After the patch:

sha512-224          : Process at    45
sha512              : Process at    45
sha512-256          : Process at    45
sha384              : Process at    45
sha1                : Process at    77
sha256              : Process at   132
sha224              : Process at   134

So it seems like your patch improves the performance in the default case (LTC_SMALL_CODE undefined) for "sha256 based", but deteriorates for "sha1".

In the case LTC_SMALL_CODE is defined it improves "sha1", but deteriorates the two others.

FYI:

$ head /proc/cpuinfo | grep 'model name'
model name      : AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics

I think my next PR will be about x86 (and amd64) specific intrinsics

OK, that sounds nice. You're also thinking about adding SHA-NI support? If yes, you could have a look at #557 how we did it for AES-NI.

MarekKnapek · 2025-12-03T15:01:26Z

My performance measurements are different. Maybe it depends on processor cache size, branch prediction buffer size and many other things.

Before:

sha1                : Process at    98
sha224              : Process at   188
sha256              : Process at   188
sha384              : Process at    58
sha512              : Process at    58
sha512-224          : Process at    58
sha512-256          : Process at    58

After:

sha1                : Process at   100
sha224              : Process at   185
sha256              : Process at   185
sha384              : Process at    67
sha512              : Process at    67
sha512-224          : Process at    67
sha512-256          : Process at    67

My command line was:
make && make test && make docs && make timing && ./test && ./helper.pl -a && ./timing hash sha

Another measurement, this time with LTC_SMALL_CODE.

Before:

sha1                : Process at   135
sha224              : Process at   190
sha256              : Process at   190
sha384              : Process at    62
sha512              : Process at    62
sha512-224          : Process at    62
sha512-256          : Process at    62

After:

sha1                : Process at   110
sha224              : Process at   229
sha256              : Process at   229
sha384              : Process at    73
sha512              : Process at    73
sha512-224          : Process at    73
sha512-256          : Process at    73

Here I was able to improve SHA-1 after speed from 110 to 121 by changing the first loop to:

    for (i = 0; i < 20; ) {
       if(i >= 16){ Wi(i); } FF0(a,b,c,d,e,i++); t = e; e = d; d = c; c = b; b = a; a = t;
    }

But it is still slower than before.

sjaeckel · 2025-12-03T16:08:05Z

My performance measurements are different.

For sure, since you most likely have a different CPU. But the differences of the algorithm classes themselves are comparable and my statement from above:

[...] your patch improves the performance in the default case (LTC_SMALL_CODE undefined) for "sha256 based", but deteriorates for "sha1".

In the case LTC_SMALL_CODE is defined it improves "sha1", but deteriorates the two others.

is thereby validated.

Maybe it depends on processor cache size, branch prediction buffer size and many other things.

Absolutely.

Here I was able to improve SHA-1 after speed from 110 to 121 by changing the first loop to:

FYI: lower value = faster, the number shown is "the number of CPU cycles per iteration" -> i.e. by having it changed from 110 to 121 you made it 10% slower :-D

My command line was: make && make test && make docs && make timing && ./test && ./helper.pl -a && ./timing hash sha

No need to run all these, especially not make test since the timing demo already runs the self-tests of the hash algorithms and would exit with an error if it failed.

To speed your local development cycle up I'd suggest you to run make -j$(($(nproc)*2+1)) timing && ./timing hash sha.
You most likely also have a multi-core CPU, so you can use that fact and run parallel builds via the -j option.

And I run ./helper.pl -a never manually, but you can also execute make install_hooks once, which will install a Git pre-commit hook which checks that this succeeds before committing.

* Add the option to only run for a subset of algos. * Improve `hash` to show something meaningful. Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>

sjaeckel

LGTM.

@MarekKnapek you're also fine with it to be merged?

MarekKnapek · 2026-04-09T11:22:20Z

Oh, this slipped out of my mind. I thought this change made it worse despite using fewer stack based variables. Possibly trading cache speed or cache usage amount with CPU speed and with CPU prediction logic. Whichever is better. Depending on CPU model, data usage and CPU usage of the app using this library.

I don't really know, if you like it, take it. If not, then don't.

Also increasing the number of configuration options might not be desirable. Increasing testing, validation, correctness efforts.

In the future I might also add x86 AES, x86 SHA-1/SHA-256 and x86 SHA-512 specific implementations. That would be even better than this. (Already done for my own project.)

Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>

sjaeckel · 2026-04-09T15:03:34Z

I thought this change made it worse despite using fewer stack based variables.

Ah nope, this PR made things better!

Thanks for that comment btw. it made me check the numbers again and pointed to a minor regression of the performance in SHA256, which lead to the following patch.

diff --git a/src/hashes/sha2/sha256.c b/src/hashes/sha2/sha256.c
index 70c132afe675..ffd8d6f8b623 100644
--- a/src/hashes/sha2/sha256.c
+++ b/src/hashes/sha2/sha256.c
@@ -116,2 +116,3 @@ static int s_sha256_compress(hash_state * md, const unsigned char *buf)
 
+#ifdef LTC_SMALL_STACK_SHA256
      for (i = 0; i < 16; ++i) {
@@ -127,2 +128,9 @@ static int s_sha256_compress(hash_state * md, const unsigned char *buf)
      }
+#else
+     for (i = 0; i < 64; ++i) {
+         RND(S[0],S[1],S[2],S[3],S[4],S[5],S[6],S[7],i);
+         t = S[7]; S[7] = S[6]; S[6] = S[5]; S[5] = S[4];
+         S[4] = S[3]; S[3] = S[2]; S[2] = S[1]; S[1] = S[0]; S[0] = t;
+     }
+#endif /* LTC_SMALL_STACK_SHA256 */
 #else

It seems like it makes a difference when breaking the loop or when executing it continuously ...

The following table contains the timing numbers on my machine of the four config options of the develop branch vs. this PR using gcc 15.2.1.

	develop	pr709
default	`sha512-based : Process at 40`	`sha512-based : Process at 41`
	`sha1 : Process at 63`	`sha1 : Process at 63`
	`sha256-based : Process at 127`	`sha256-based : Process at 111`
---	---	---
SMALL_CODE	`sha512-based : Process at 42`	`sha512-based : Process at 39`
	`sha1 : Process at 89`	`sha1 : Process at 77`
	`sha256-based : Process at 113`	`sha256-based : Process at 113`
---	---	---
SMALL_STACK		`sha512-based : Process at 42`
		`sha1 : Process at 72`
		`sha256-based : Process at 111`
---	---	---
SMALL_STACK		`sha512-based : Process at 47`
+ SMALL_CODE		`sha1 : Process at 78`
		`sha256-based : Process at 138`

The numbers using clang 21.1.8 differ quite a bit... some numbers are surprising, but OK ...

	develop	pr709
default	`sha512-based : Process at 41`	`sha512-based : Process at 43`
	`sha1 : Process at 91`	`sha1 : Process at 91`
	`sha256-based : Process at 143`	`sha256-based : Process at 106`
---	---	---
SMALL_CODE	`sha512-based : Process at 46`	`sha512-based : Process at 46`
	`sha1 : Process at 91`	`sha1 : Process at 83`
	`sha256-based : Process at 143`	`sha256-based : Process at 144`
---	---	---
SMALL_STACK		`sha512-based : Process at 38`
		`sha1 : Process at 64`
		`sha256-based : Process at 107`
---	---	---
SMALL_STACK		`sha512-based : Process at 44`
+ SMALL_CODE		`sha1 : Process at 84`
		`sha256-based : Process at 140`

All in all this a good improvement I'd say, since it improved the performance of SHA256 in the default configuration for both compilers tested.

In the future I might also add x86 AES, x86 SHA-1/SHA-256 and x86 SHA-512 specific implementations. That would be even better than this. (Already done for my own project.)

That'd be awesome :) Looking forward to it!

sjaeckel · 2026-04-09T15:06:55Z

In the future I might also add x86 AES, x86 SHA-1/SHA-256 and x86 SHA-512 specific implementations. That would be even better than this. (Already done for my own project.)

As already mentioned, #557 added AES-NI and #714 shows another good example of how such an integration could look like!

Thanks again for this PR and I'm looking forward to your next one :)

Smaller stack usage for SHA-1, SHA-256 and SHA-512.

sjaeckel reviewed Dec 1, 2025

View reviewed changes

sjaeckel force-pushed the sha-stack branch from a99f450 to 2a5c99d Compare December 2, 2025 08:25

MarekKnapek and others added 2 commits April 9, 2026 08:29

Smaller stack usage for SHA-1, SHA-256 and SHA-512.

17cbd2b

Slightly improve timing demo.

a9cd3cd

* Add the option to only run for a subset of algos. * Improve `hash` to show something meaningful. Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>

sjaeckel force-pushed the sha-stack branch 3 times, most recently from 4a5791a to 3cdf96b Compare April 9, 2026 06:59

sjaeckel approved these changes Apr 9, 2026

View reviewed changes

sjaeckel added 3 commits April 9, 2026 15:40

Add option LTC_SMALL_STACK.

77afa82

Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>

Basic algo self-tests should be run before the ones using them.

1192988

Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>

Update docs.

cc53195

Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>

sjaeckel force-pushed the sha-stack branch from bdd0b0e to cc53195 Compare April 9, 2026 13:40

sjaeckel merged commit 3223b87 into libtom:develop Apr 9, 2026
246 checks passed

MarekKnapek deleted the sha-stack branch April 9, 2026 18:50

MarekKnapek pushed a commit to MarekKnapek/libtomcrypt that referenced this pull request Apr 13, 2026

Merge pull request libtom#709 from MarekKnapek/sha-stack

0c182a0

Smaller stack usage for SHA-1, SHA-256 and SHA-512.

sjaeckel added this to the next milestone Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smaller stack usage for SHA-1, SHA-256 and SHA-512.#709

Smaller stack usage for SHA-1, SHA-256 and SHA-512.#709
sjaeckel merged 5 commits into
libtom:developfrom
MarekKnapek:sha-stack

MarekKnapek commented Nov 29, 2025 •

edited by sjaeckel

Loading

Uh oh!

sjaeckel left a comment

Uh oh!

MarekKnapek commented Dec 1, 2025

Uh oh!

sjaeckel commented Dec 2, 2025

Uh oh!

MarekKnapek commented Dec 3, 2025

Uh oh!

sjaeckel commented Dec 3, 2025

Uh oh!

sjaeckel left a comment •

edited

Loading

Uh oh!

MarekKnapek commented Apr 9, 2026

Uh oh!

sjaeckel commented Apr 9, 2026

Uh oh!

Uh oh!

sjaeckel commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MarekKnapek commented Nov 29, 2025 • edited by sjaeckel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

sjaeckel left a comment

Choose a reason for hiding this comment

Uh oh!

MarekKnapek commented Dec 1, 2025

Uh oh!

sjaeckel commented Dec 2, 2025

Uh oh!

MarekKnapek commented Dec 3, 2025

Uh oh!

sjaeckel commented Dec 3, 2025

Uh oh!

sjaeckel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarekKnapek commented Apr 9, 2026

Uh oh!

sjaeckel commented Apr 9, 2026

Uh oh!

Uh oh!

sjaeckel commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MarekKnapek commented Nov 29, 2025 •

edited by sjaeckel

Loading

sjaeckel left a comment •

edited

Loading