Smaller stack usage for SHA-1, SHA-256 and SHA-512.#709
Smaller stack usage for SHA-1, SHA-256 and SHA-512.#709sjaeckel merged 5 commits intolibtom:developfrom
Conversation
sjaeckel
left a comment
There was a problem hiding this comment.
That looks interesting. Thanks for the next PR :)
When looking at it it seems like we'd be trading computation in space to computation in time, meaning that the execution should be slower after the patch applied.
So I modified the timing demo a bit to show something relevant, and the before looks as follows:
sha512 : Process at 39
sha512-256 : Process at 39
sha384 : Process at 39
sha512-224 : Process at 39
sha1 : Process at 61
sha256 : Process at 122
sha224 : Process at 122
vs. after this patch applied:
sha512 : Process at 39
sha384 : Process at 40
sha512-256 : Process at 40
sha512-224 : Process at 40
sha1 : Process at 68
sha224 : Process at 106
sha256 : Process at 106
sha1 really got worse, sha512-based stayed more or less the same (maybe a little bit slower), but sha256-based got significantly better performance!?
Not sure what to do with sha1, maybe enable this patch via a new LTC_SMALL_STACK option?
The other two I'd simply take unconditionally.
What do you think?
|
I think my next PR will be about x86 (and amd64) specific intrinsics. Making the SHA-1, SHA-256 and SHA-512 much, much faster.
|
That's the timinig demo in
That depends on how you build the library. I usually simply run If you use CMake (and build in a folder inside the ltc folder) it'd be
Those previous tests were done with the standard config. With Before the patch: After the patch: So it seems like your patch improves the performance in the default case ( In the case FYI:
OK, that sounds nice. You're also thinking about adding |
|
My performance measurements are different. Maybe it depends on processor cache size, branch prediction buffer size and many other things. Before: After: My command line was: Another measurement, this time with Before: After: Here I was able to improve SHA-1 But it is still slower than |
For sure, since you most likely have a different CPU. But the differences of the algorithm classes themselves are comparable and my statement from above:
is thereby validated.
Absolutely.
FYI: lower value = faster, the number shown is "the number of CPU cycles per iteration" -> i.e. by having it changed from 110 to 121 you made it 10% slower :-D
No need to run all these, especially not To speed your local development cycle up I'd suggest you to run And I run |
* Add the option to only run for a subset of algos. * Improve `hash` to show something meaningful. Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
4a5791a to
3cdf96b
Compare
There was a problem hiding this comment.
LGTM.
@MarekKnapek you're also fine with it to be merged?
|
Oh, this slipped out of my mind. I thought this change made it worse despite using fewer stack based variables. Possibly trading cache speed or cache usage amount with CPU speed and with CPU prediction logic. Whichever is better. Depending on CPU model, data usage and CPU usage of the app using this library. I don't really know, if you like it, take it. If not, then don't. Also increasing the number of configuration options might not be desirable. Increasing testing, validation, correctness efforts. In the future I might also add x86 AES, x86 SHA-1/SHA-256 and x86 SHA-512 specific implementations. That would be even better than this. (Already done for my own project.) |
Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
Signed-off-by: Steffen Jaeckel <s@jaeckel.eu>
Ah nope, this PR made things better! Thanks for that comment btw. it made me check the numbers again and pointed to a minor regression of the performance in SHA256, which lead to the following patch. It seems like it makes a difference when breaking the loop or when executing it continuously ... The following table contains the timing numbers on my machine of the four config options of the develop branch vs. this PR using gcc 15.2.1.
The numbers using clang 21.1.8 differ quite a bit... some numbers are surprising, but OK ...
All in all this a good improvement I'd say, since it improved the performance of SHA256 in the default configuration for both compilers tested.
That'd be awesome :) Looking forward to it! |
As already mentioned, #557 added AES-NI and #714 shows another good example of how such an integration could look like! Thanks again for this PR and I'm looking forward to your next one :) |
Smaller stack usage for SHA-1, SHA-256 and SHA-512.
Checklist