Skip to content

Smaller stack usage for SHA-1, SHA-256 and SHA-512.#709

Merged
sjaeckel merged 5 commits intolibtom:developfrom
MarekKnapek:sha-stack
Apr 9, 2026
Merged

Smaller stack usage for SHA-1, SHA-256 and SHA-512.#709
sjaeckel merged 5 commits intolibtom:developfrom
MarekKnapek:sha-stack

Conversation

@MarekKnapek
Copy link
Copy Markdown
Contributor

@MarekKnapek MarekKnapek commented Nov 29, 2025

Checklist

  • documentation is added or updated
  • tests are added or updated

Copy link
Copy Markdown
Member

@sjaeckel sjaeckel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks interesting. Thanks for the next PR :)

When looking at it it seems like we'd be trading computation in space to computation in time, meaning that the execution should be slower after the patch applied.

So I modified the timing demo a bit to show something relevant, and the before looks as follows:

sha512              : Process at    39
sha512-256          : Process at    39
sha384              : Process at    39
sha512-224          : Process at    39
sha1                : Process at    61
sha256              : Process at   122
sha224              : Process at   122

vs. after this patch applied:

sha512              : Process at    39
sha384              : Process at    40
sha512-256          : Process at    40
sha512-224          : Process at    40
sha1                : Process at    68
sha224              : Process at   106
sha256              : Process at   106

sha1 really got worse, sha512-based stayed more or less the same (maybe a little bit slower), but sha256-based got significantly better performance!?

Not sure what to do with sha1, maybe enable this patch via a new LTC_SMALL_STACK option?
The other two I'd simply take unconditionally.

What do you think?

@MarekKnapek
Copy link
Copy Markdown
Contributor Author

I think my next PR will be about x86 (and amd64) specific intrinsics. Making the SHA-1, SHA-256 and SHA-512 much, much faster.

  • How do I run these benchmarks myself? I could play with the code a bit more, maybe adding an if(i >= 16){ Wi(i) } condition into the loops.
  • Also what configuration option did you use? I mean with LTC_SMALL_CODE or without?
  • And how do I enable/disable this option at compile time? So far I manually edited some global header file to enable/disable this option. But I believe there might be some more kosher way of toggling this option.

@sjaeckel
Copy link
Copy Markdown
Member

sjaeckel commented Dec 2, 2025

  • How do I run these benchmarks myself? I could play with the code a bit more, maybe adding an if(i >= 16){ Wi(i) } condition into the loops.

That's the timinig demo in demos/timing.c. I've just pushed an update to it and ran it as ./timing hash sha, then removed the sha3 parts manually when pasting its output here.

  • And how do I enable/disable this option at compile time?

That depends on how you build the library.

I usually simply run make, so it's a matter of make -j$(($(nproc)*2+1)) timing CFLAGS="-DLTC_SMALL_CODE".

If you use CMake (and build in a folder inside the ltc folder) it'd be cmake -DLTC_CFLAGS="-DLTC_SMALL_CODE" -DCMAKE_BUILD_TYPE=Release -DBUILD_USABLE_DEMOS=On .. && make -j$(($(nproc)*2+1)), then run ./demos/ltc-timing hash sha.

Also what configuration option did you use? I mean with LTC_SMALL_CODE or without?

Those previous tests were done with the standard config. With LTC_SMALL_CODE enabled they look like this:

Before the patch:

sha512              : Process at    39
sha384              : Process at    39
sha512-256          : Process at    40
sha512-224          : Process at    40
sha1                : Process at    84
sha256              : Process at   108
sha224              : Process at   108

After the patch:

sha512-224          : Process at    45
sha512              : Process at    45
sha512-256          : Process at    45
sha384              : Process at    45
sha1                : Process at    77
sha256              : Process at   132
sha224              : Process at   134

So it seems like your patch improves the performance in the default case (LTC_SMALL_CODE undefined) for "sha256 based", but deteriorates for "sha1".

In the case LTC_SMALL_CODE is defined it improves "sha1", but deteriorates the two others.

FYI:

$ head /proc/cpuinfo | grep 'model name'
model name      : AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics

I think my next PR will be about x86 (and amd64) specific intrinsics

OK, that sounds nice. You're also thinking about adding SHA-NI support? If yes, you could have a look at #557 how we did it for AES-NI.

@MarekKnapek
Copy link
Copy Markdown
Contributor Author

My performance measurements are different. Maybe it depends on processor cache size, branch prediction buffer size and many other things.

Before:

sha1                : Process at    98
sha224              : Process at   188
sha256              : Process at   188
sha384              : Process at    58
sha512              : Process at    58
sha512-224          : Process at    58
sha512-256          : Process at    58

After:

sha1                : Process at   100
sha224              : Process at   185
sha256              : Process at   185
sha384              : Process at    67
sha512              : Process at    67
sha512-224          : Process at    67
sha512-256          : Process at    67

My command line was:
make && make test && make docs && make timing && ./test && ./helper.pl -a && ./timing hash sha

Another measurement, this time with LTC_SMALL_CODE.

Before:

sha1                : Process at   135
sha224              : Process at   190
sha256              : Process at   190
sha384              : Process at    62
sha512              : Process at    62
sha512-224          : Process at    62
sha512-256          : Process at    62

After:

sha1                : Process at   110
sha224              : Process at   229
sha256              : Process at   229
sha384              : Process at    73
sha512              : Process at    73
sha512-224          : Process at    73
sha512-256          : Process at    73

Here I was able to improve SHA-1 after speed from 110 to 121 by changing the first loop to:

    for (i = 0; i < 20; ) {
       if(i >= 16){ Wi(i); } FF0(a,b,c,d,e,i++); t = e; e = d; d = c; c = b; b = a; a = t;
    }

But it is still slower than before.

@sjaeckel
Copy link
Copy Markdown
Member

sjaeckel commented Dec 3, 2025

My performance measurements are different.

For sure, since you most likely have a different CPU. But the differences of the algorithm classes themselves are comparable and my statement from above:

[...] your patch improves the performance in the default case (LTC_SMALL_CODE undefined) for "sha256 based", but deteriorates for "sha1".

In the case LTC_SMALL_CODE is defined it improves "sha1", but deteriorates the two others.

is thereby validated.

Maybe it depends on processor cache size, branch prediction buffer size and many other things.

Absolutely.

Here I was able to improve SHA-1 after speed from 110 to 121 by changing the first loop to:

FYI: lower value = faster, the number shown is "the number of CPU cycles per iteration" -> i.e. by having it changed from 110 to 121 you made it 10% slower :-D

My command line was: make && make test && make docs && make timing && ./test && ./helper.pl -a && ./timing hash sha

No need to run all these, especially not make test since the timing demo already runs the self-tests of the hash algorithms and would exit with an error if it failed.

To speed your local development cycle up I'd suggest you to run make -j$(($(nproc)*2+1)) timing && ./timing hash sha.
You most likely also have a multi-core CPU, so you can use that fact and run parallel builds via the -j option.

And I run ./helper.pl -a never manually, but you can also execute make install_hooks once, which will install a Git pre-commit hook which checks that this succeeds before committing.

MarekKnapek and others added 2 commits April 9, 2026 08:29
* Add the option to only run for a subset of algos.
* Improve `hash` to show something meaningful.

Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
@sjaeckel sjaeckel force-pushed the sha-stack branch 3 times, most recently from 4a5791a to 3cdf96b Compare April 9, 2026 06:59
Copy link
Copy Markdown
Member

@sjaeckel sjaeckel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@MarekKnapek you're also fine with it to be merged?

@MarekKnapek
Copy link
Copy Markdown
Contributor Author

Oh, this slipped out of my mind. I thought this change made it worse despite using fewer stack based variables. Possibly trading cache speed or cache usage amount with CPU speed and with CPU prediction logic. Whichever is better. Depending on CPU model, data usage and CPU usage of the app using this library.

I don't really know, if you like it, take it. If not, then don't.

Also increasing the number of configuration options might not be desirable. Increasing testing, validation, correctness efforts.

In the future I might also add x86 AES, x86 SHA-1/SHA-256 and x86 SHA-512 specific implementations. That would be even better than this. (Already done for my own project.)

sjaeckel added 3 commits April 9, 2026 15:40
Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
@sjaeckel
Copy link
Copy Markdown
Member

sjaeckel commented Apr 9, 2026

I thought this change made it worse despite using fewer stack based variables.

Ah nope, this PR made things better!

Thanks for that comment btw. it made me check the numbers again and pointed to a minor regression of the performance in SHA256, which lead to the following patch.

diff --git a/src/hashes/sha2/sha256.c b/src/hashes/sha2/sha256.c
index 70c132afe675..ffd8d6f8b623 100644
--- a/src/hashes/sha2/sha256.c
+++ b/src/hashes/sha2/sha256.c
@@ -116,2 +116,3 @@ static int s_sha256_compress(hash_state * md, const unsigned char *buf)
 
+#ifdef LTC_SMALL_STACK_SHA256
      for (i = 0; i < 16; ++i) {
@@ -127,2 +128,9 @@ static int s_sha256_compress(hash_state * md, const unsigned char *buf)
      }
+#else
+     for (i = 0; i < 64; ++i) {
+         RND(S[0],S[1],S[2],S[3],S[4],S[5],S[6],S[7],i);
+         t = S[7]; S[7] = S[6]; S[6] = S[5]; S[5] = S[4];
+         S[4] = S[3]; S[3] = S[2]; S[2] = S[1]; S[1] = S[0]; S[0] = t;
+     }
+#endif /* LTC_SMALL_STACK_SHA256 */
 #else

It seems like it makes a difference when breaking the loop or when executing it continuously ...

The following table contains the timing numbers on my machine of the four config options of the develop branch vs. this PR using gcc 15.2.1.

develop pr709
default sha512-based : Process at 40 sha512-based : Process at 41
sha1 : Process at 63 sha1 : Process at 63
sha256-based : Process at 127 sha256-based : Process at 111
--- --- ---
SMALL_CODE sha512-based : Process at 42 sha512-based : Process at 39
sha1 : Process at 89 sha1 : Process at 77
sha256-based : Process at 113 sha256-based : Process at 113
--- --- ---
SMALL_STACK sha512-based : Process at 42
sha1 : Process at 72
sha256-based : Process at 111
--- --- ---
SMALL_STACK sha512-based : Process at 47
+ SMALL_CODE sha1 : Process at 78
sha256-based : Process at 138

The numbers using clang 21.1.8 differ quite a bit... some numbers are surprising, but OK ...

develop pr709
default sha512-based : Process at 41 sha512-based : Process at 43
sha1 : Process at 91 sha1 : Process at 91
sha256-based : Process at 143 sha256-based : Process at 106
--- --- ---
SMALL_CODE sha512-based : Process at 46 sha512-based : Process at 46
sha1 : Process at 91 sha1 : Process at 83
sha256-based : Process at 143 sha256-based : Process at 144
--- --- ---
SMALL_STACK sha512-based : Process at 38
sha1 : Process at 64
sha256-based : Process at 107
--- --- ---
SMALL_STACK sha512-based : Process at 44
+ SMALL_CODE sha1 : Process at 84
sha256-based : Process at 140

All in all this a good improvement I'd say, since it improved the performance of SHA256 in the default configuration for both compilers tested.

In the future I might also add x86 AES, x86 SHA-1/SHA-256 and x86 SHA-512 specific implementations. That would be even better than this. (Already done for my own project.)

That'd be awesome :) Looking forward to it!

@sjaeckel sjaeckel merged commit 3223b87 into libtom:develop Apr 9, 2026
246 checks passed
@sjaeckel
Copy link
Copy Markdown
Member

sjaeckel commented Apr 9, 2026

In the future I might also add x86 AES, x86 SHA-1/SHA-256 and x86 SHA-512 specific implementations. That would be even better than this. (Already done for my own project.)

As already mentioned, #557 added AES-NI and #714 shows another good example of how such an integration could look like!

Thanks again for this PR and I'm looking forward to your next one :)

@MarekKnapek MarekKnapek deleted the sha-stack branch April 9, 2026 18:50
MarekKnapek pushed a commit to MarekKnapek/libtomcrypt that referenced this pull request Apr 13, 2026
Smaller stack usage for SHA-1, SHA-256 and SHA-512.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants