Feature/issue 3311 test thread tbb exp#3314
Conversation
parallel_for, blocked range compiles for stan::math::exp compiling blocked_range works fine some progress, now a type deduction issue? ok something closer... implement struct version for parallel_for... uncompiled begin new class to use parallel for almost compiles... getting close, have template deduction failed which we can figure out almost compiles hold on compiles remove dead code compiled parallel_for, blocked_range for stan::math::exp compiled parallel_for, blocked_range for stan::math::exp
|
Hold on, sorry I should re-base. I have some questions, wondering if anyone had comments or is this all on me? Refactor, and using threads at lower number of observations. |
…rezap/math into feature/issue-3311-test-thread-tbb-exp
|
Do you have a graph that shows the speedup? Overall I'd be kind of cautious introducing lower level threading like this. Like you saw, whether you get a speedup or slowdown depends a lot on the number of observations. So for every vector operations we would have to have a check that the size exceeded some threshold. That threshold is going to vary a lot per computer and I think I think if we are not careful could make the codebase kind of funky. The other piece here is that this works for |
|
I’m thinking about, haven’t thought too far ahead yet, thank you.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Thu, Apr 30, 2026 at 5:46 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
Do you have a graph that shows the speedup? Overall I'd be kind of
cautious introducing lower level threading like this. Like you saw, whether
you get a speedup or slowdown depends a lot on the number of observations.
So for every vector operations we would have to have a check that the size
exceeded some threshold. That threshold is going to vary a lot per computer
and I think I think if we are not careful could make the codebase kind of
funky.
The other piece here is that this works for prim functions of double
type, but parallelism is much harder for reverse mode which is the main
piece of the math library we worry about. The main issue is handling how
the global AD tape should sync when we have jobs across N threads.
@andrjohns <https://github.com/andrjohns> thought for a long while trying
to figure out how to do a nice parallel map(...) style function for
reverse mode autodiff. I'm not sure he came up with something he found
satisfying. I have not either honestly. Essentially you need to shard the
operation over N shards which will have N autodiff stacks, then once the
parallel computation is done you have to pass those autodiff stacks back
and put them onto the main thread's stack. So there you would get
performance benefits for setting up the forward pass in parallel, but then
the reverse pass would still be serial and you pay the cost of the sharding
and thread startup. I'm very certain there is a way to do it so you can do
the forward and reverse pass in parallel, but nothing has ever come to me
for this problem.
—
Reply to this email directly, view it on GitHub
<#3314 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543AUG7YY66QH7E5MAKL4YPCSVAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGNJWGM3DSNBVGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
I'm doing continuous integration tests, it looks like it's mostly passing now.
And I need to consider threading the rev autodiff stack, that would be cool, if different threads could build different expression trees, I think that's what Steve was saying. But if this adds incremental speed increase, why not? WRT Steves comment I can think about it, but here I'm not parallelizing anything on the stack, just evaluation of the computation of |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
|
Not sure why Jenkins emailed me SUCCESS when there's so many errors? I'm not seeing these locally. I also named the branch wrong, but I'll just leave it until it's closed... |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
|
It's like some declarations, which are essentially just if statements that
determine whether a certain area of code will be compiled or not.
I'm skimming this I'm waiting on a bootloader for a free Mac I got.
I'm not seeing any direct comparisons between threaded and non threaded
code, and there seems to be a discrepancy between concurrency and running a
process on different cores.
I'm with Bob:
#1918 (comment)
Instead of chatting, let's come up with a concrete way of determining
whether something is faster.
WRT maintenance, it's like 3 lines of code and some declaratives. Easy to
maintain.
I seem to have accidentally discovered Ahmdal's law. So I propose we come
up with concrete objectives to benchmarks and if it's faster we proceed.
Also, typing matters.
And I'm not sure about how reliable the posteriorDB estimates are, but
locally in Stan/math parallelization with tbb was faster within limits
(#threads matters, etc) but if running this on exp with many evaluations of
a gaussian distribution for example for thousands of iterations this could
be worth it. But to play devil's advocate, recollecting threads could also
also slow it down.
Again, I'm handicapped no computer.
But in summary, I don't think the linked thread effectively evaluates
whether this is faster or not. All HPC devs use threading, no? Any ringers
we can bring in?
But wrt maintenance, easy.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I'm reposting my reply from here
<#3314 (comment)>
The main issue is, when does the overhead of threading make it worth
running in parallel? i.e. if a user has a vector of 100 elements, how many
threads should be used for exp(x)? None? Two? Threading has a decently high
overhead cost and for each instantiation of threading you pay for that. So
for small problems the answer is most likely single threaded. When we are
trying to add automatic parallelization we will need to ask at runtime "how
many threads should this operation use given the data size?" that requires
understanding a lot of information about how fast a users particular
machine can calculate a function and is a pretty hard problem. Eigen does
this, but only for matrix multiplication as they have very good runtime
logic to detect if sharding a large matrix multiply across threads is worth
it.
The logic for deciding at runtime whether a particular function is worth
moving over to the gpu is going to be a lot of developer and runtime
overhead. imo I think the maintanence would not be worth it.
This has been attempted previously be @andrjohns
<https://github.com/andrjohns> (and I took a crack at it myself). You can
see that whole conversation here
<#1918>
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
And again, in AnderJohns thread I'm not seeing any direct comparisons
between threaded and unthreaded. I.e. there's no control and treatment
group. We can't just guess. We need to systematically evaluate it.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote:
It's like some declarations, which are essentially just if statements that
determine whether a certain area of code will be compiled or not.
I'm skimming this I'm waiting on a bootloader for a free Mac I got.
I'm not seeing any direct comparisons between threaded and non threaded
code, and there seems to be a discrepancy between concurrency and running a
process on different cores.
I'm with Bob:
#1918 (comment)
Instead of chatting, let's come up with a concrete way of determining
whether something is faster.
WRT maintenance, it's like 3 lines of code and some declaratives. Easy to
maintain.
I seem to have accidentally discovered Ahmdal's law. So I propose we come
up with concrete objectives to benchmarks and if it's faster we proceed.
Also, typing matters.
And I'm not sure about how reliable the posteriorDB estimates are, but
locally in Stan/math parallelization with tbb was faster within limits
(#threads matters, etc) but if running this on exp with many evaluations of
a gaussian distribution for example for thousands of iterations this could
be worth it. But to play devil's advocate, recollecting threads could also
also slow it down.
Again, I'm handicapped no computer.
But in summary, I don't think the linked thread effectively evaluates
whether this is faster or not. All HPC devs use threading, no? Any ringers
we can bring in?
But wrt maintenance, easy.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
wrote:
> *SteveBronder* left a comment (stan-dev/math#3314)
> <#3314 (comment)>
>
> I'm reposting my reply from here
> <#3314 (comment)>
>
> The main issue is, when does the overhead of threading make it worth
> running in parallel? i.e. if a user has a vector of 100 elements, how many
> threads should be used for exp(x)? None? Two? Threading has a decently high
> overhead cost and for each instantiation of threading you pay for that. So
> for small problems the answer is most likely single threaded. When we are
> trying to add automatic parallelization we will need to ask at runtime "how
> many threads should this operation use given the data size?" that requires
> understanding a lot of information about how fast a users particular
> machine can calculate a function and is a pretty hard problem. Eigen does
> this, but only for matrix multiplication as they have very good runtime
> logic to detect if sharding a large matrix multiply across threads is worth
> it.
>
> The logic for deciding at runtime whether a particular function is worth
> moving over to the gpu is going to be a lot of developer and runtime
> overhead. imo I think the maintanence would not be worth it.
>
> This has been attempted previously be @andrjohns
> <https://github.com/andrjohns> (and I took a crack at it myself). You
> can see that whole conversation here
> <#1918>
>
> —
> Reply to this email directly, view it on GitHub
> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
And I answered these questions with my benchmarks, so it's not a big
mystery: Amdahl's law seems to apply.
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I'm reposting my reply from here
<#3314 (comment)>
The main issue is, when does the overhead of threading make it worth
running in parallel? i.e. if a user has a vector of 100 elements, how many
threads should be used for exp(x)? None? Two? Threading has a decently high
overhead cost and for each instantiation of threading you pay for that. So
for small problems the answer is
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote:
And again, in AnderJohns thread I'm not seeing any direct comparisons
between threaded and unthreaded. I.e. there's no control and treatment
group. We can't just guess. We need to systematically evaluate it.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote:
> It's like some declarations, which are essentially just if statements
> that determine whether a certain area of code will be compiled or not.
>
> I'm skimming this I'm waiting on a bootloader for a free Mac I got.
>
> I'm not seeing any direct comparisons between threaded and non threaded
> code, and there seems to be a discrepancy between concurrency and running a
> process on different cores.
>
> I'm with Bob:
> #1918 (comment)
>
> Instead of chatting, let's come up with a concrete way of determining
> whether something is faster.
>
> WRT maintenance, it's like 3 lines of code and some declaratives. Easy to
> maintain.
>
> I seem to have accidentally discovered Ahmdal's law. So I propose we come
> up with concrete objectives to benchmarks and if it's faster we proceed.
> Also, typing matters.
>
> And I'm not sure about how reliable the posteriorDB estimates are, but
> locally in Stan/math parallelization with tbb was faster within limits
> (#threads matters, etc) but if running this on exp with many evaluations of
> a gaussian distribution for example for thousands of iterations this could
> be worth it. But to play devil's advocate, recollecting threads could also
> also slow it down.
>
> Again, I'm handicapped no computer.
>
> But in summary, I don't think the linked thread effectively evaluates
> whether this is faster or not. All HPC devs use threading, no? Any ringers
> we can bring in?
>
> But wrt maintenance, easy.
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
> wrote:
>
>> *SteveBronder* left a comment (stan-dev/math#3314)
>> <#3314 (comment)>
>>
>> I'm reposting my reply from here
>> <#3314 (comment)>
>>
>> The main issue is, when does the overhead of threading make it worth
>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>> threads should be used for exp(x)? None? Two? Threading has a decently high
>> overhead cost and for each instantiation of threading you pay for that. So
>> for small problems the answer is most likely single threaded. When we are
>> trying to add automatic parallelization we will need to ask at runtime "how
>> many threads should this operation use given the data size?" that requires
>> understanding a lot of information about how fast a users particular
>> machine can calculate a function and is a pretty hard problem. Eigen does
>> this, but only for matrix multiplication as they have very good runtime
>> logic to detect if sharding a large matrix multiply across threads is worth
>> it.
>>
>> The logic for deciding at runtime whether a particular function is worth
>> moving over to the gpu is going to be a lot of developer and runtime
>> overhead. imo I think the maintanence would not be worth it.
>>
>> This has been attempted previously be @andrjohns
>> <https://github.com/andrjohns> (and I took a crack at it myself). You
>> can see that whole conversation here
>> <#1918>
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
>
>
>
>
|
|
What I am suggesting is we ignore MCMC for now, and just go with runtime at
evaluating prob distributions. Pretty much all of them use an exponential.
So if there's a slight gain on evaluating computations then it's totally
worth it to add, no? I think a lot of developers do this under the hood but
don't expose it to users. Do the threads navigate through composite
functions (i.e. normal distribution)? no idea. but the tests I ran seemed
to improve performance, if we're not considering auto diff. they passed
tests. I am going for performance, not fancy publications if that makes
sense. But I'm sure devs do this under the hood for game dev etc. The code
I added was only a few lines, and some declaratives.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote:
And I answered these questions with my benchmarks, so it's not a big
mystery: Amdahl's law seems to apply.
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I'm reposting my reply from here
<#3314 (comment)>
The main issue is, when does the overhead of threading make it worth
running in parallel? i.e. if a user has a vector of 100 elements, how many
threads should be used for exp(x)? None? Two? Threading has a decently high
overhead cost and for each instantiation of threading you pay for that. So
for small problems the answer is
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote:
> And again, in AnderJohns thread I'm not seeing any direct comparisons
> between threaded and unthreaded. I.e. there's no control and treatment
> group. We can't just guess. We need to systematically evaluate it.
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote:
>
>> It's like some declarations, which are essentially just if statements
>> that determine whether a certain area of code will be compiled or not.
>>
>> I'm skimming this I'm waiting on a bootloader for a free Mac I got.
>>
>> I'm not seeing any direct comparisons between threaded and non threaded
>> code, and there seems to be a discrepancy between concurrency and running a
>> process on different cores.
>>
>> I'm with Bob:
>> #1918 (comment)
>>
>> Instead of chatting, let's come up with a concrete way of determining
>> whether something is faster.
>>
>> WRT maintenance, it's like 3 lines of code and some declaratives. Easy
>> to maintain.
>>
>> I seem to have accidentally discovered Ahmdal's law. So I propose we
>> come up with concrete objectives to benchmarks and if it's faster we
>> proceed. Also, typing matters.
>>
>> And I'm not sure about how reliable the posteriorDB estimates are, but
>> locally in Stan/math parallelization with tbb was faster within limits
>> (#threads matters, etc) but if running this on exp with many evaluations of
>> a gaussian distribution for example for thousands of iterations this could
>> be worth it. But to play devil's advocate, recollecting threads could also
>> also slow it down.
>>
>> Again, I'm handicapped no computer.
>>
>> But in summary, I don't think the linked thread effectively evaluates
>> whether this is faster or not. All HPC devs use threading, no? Any ringers
>> we can bring in?
>>
>> But wrt maintenance, easy.
>>
>> Best,
>>
>>
>> Andre Zapico
>> linkedin.com/in/andre-zapico
>> gitub.com/drezap
>>
>>
>> ME Information and Communication Engineering
>> University of Electronic Science and Technology of China
>>
>> Consultant, Owner
>> likely llc
>> likelyllc.com
>>
>> Stan Developer
>> mc-stan.org
>>
>> BS Mathematical Sciences: Probabilistic Methods
>> BS Statistics
>> University of Michigan, Ann Arbor 2017
>>
>> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
>> wrote:
>>
>>> *SteveBronder* left a comment (stan-dev/math#3314)
>>> <#3314 (comment)>
>>>
>>> I'm reposting my reply from here
>>> <#3314 (comment)>
>>>
>>> The main issue is, when does the overhead of threading make it worth
>>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>>> threads should be used for exp(x)? None? Two? Threading has a decently high
>>> overhead cost and for each instantiation of threading you pay for that. So
>>> for small problems the answer is most likely single threaded. When we are
>>> trying to add automatic parallelization we will need to ask at runtime "how
>>> many threads should this operation use given the data size?" that requires
>>> understanding a lot of information about how fast a users particular
>>> machine can calculate a function and is a pretty hard problem. Eigen does
>>> this, but only for matrix multiplication as they have very good runtime
>>> logic to detect if sharding a large matrix multiply across threads is worth
>>> it.
>>>
>>> The logic for deciding at runtime whether a particular function is
>>> worth moving over to the gpu is going to be a lot of developer and runtime
>>> overhead. imo I think the maintanence would not be worth it.
>>>
>>> This has been attempted previously be @andrjohns
>>> <https://github.com/andrjohns> (and I took a crack at it myself). You
>>> can see that whole conversation here
>>> <#1918>
>>>
>>> —
>>> Reply to this email directly, view it on GitHub
>>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
>>> or unsubscribe
>>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
>>> .
>>> You are receiving this because you were mentioned.Message ID:
>>> ***@***.***>
>>>
>>
>>
>>
>>
>>
|
|
Here, I found this informative.
***@***.***/parallel-reduction-in-cuda-bba5e3d124b9
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 8:07 PM Andre Zapico ***@***.***> wrote:
What I am suggesting is we ignore MCMC for now, and just go with runtime
at evaluating prob distributions. Pretty much all of them use an
exponential. So if there's a slight gain on evaluating computations then
it's totally worth it to add, no? I think a lot of developers do this under
the hood but don't expose it to users. Do the threads navigate through
composite functions (i.e. normal distribution)? no idea. but the tests I
ran seemed to improve performance, if we're not considering auto diff. they
passed tests. I am going for performance, not fancy publications if that
makes sense. But I'm sure devs do this under the hood for game dev etc. The
code I added was only a few lines, and some declaratives.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote:
> And I answered these questions with my benchmarks, so it's not a big
> mystery: Amdahl's law seems to apply.
>
>
> *SteveBronder* left a comment (stan-dev/math#3314)
> <#3314 (comment)>
>
> I'm reposting my reply from here
> <#3314 (comment)>
>
> The main issue is, when does the overhead of threading make it worth
> running in parallel? i.e. if a user has a vector of 100 elements, how many
> threads should be used for exp(x)? None? Two? Threading has a decently high
> overhead cost and for each instantiation of threading you pay for that. So
> for small problems the answer is
>
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote:
>
>> And again, in AnderJohns thread I'm not seeing any direct comparisons
>> between threaded and unthreaded. I.e. there's no control and treatment
>> group. We can't just guess. We need to systematically evaluate it.
>>
>> Best,
>>
>>
>> Andre Zapico
>> linkedin.com/in/andre-zapico
>> gitub.com/drezap
>>
>>
>> ME Information and Communication Engineering
>> University of Electronic Science and Technology of China
>>
>> Consultant, Owner
>> likely llc
>> likelyllc.com
>>
>> Stan Developer
>> mc-stan.org
>>
>> BS Mathematical Sciences: Probabilistic Methods
>> BS Statistics
>> University of Michigan, Ann Arbor 2017
>>
>> On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***>
>> wrote:
>>
>>> It's like some declarations, which are essentially just if statements
>>> that determine whether a certain area of code will be compiled or not.
>>>
>>> I'm skimming this I'm waiting on a bootloader for a free Mac I got.
>>>
>>> I'm not seeing any direct comparisons between threaded and non threaded
>>> code, and there seems to be a discrepancy between concurrency and running a
>>> process on different cores.
>>>
>>> I'm with Bob:
>>> #1918 (comment)
>>>
>>> Instead of chatting, let's come up with a concrete way of determining
>>> whether something is faster.
>>>
>>> WRT maintenance, it's like 3 lines of code and some declaratives. Easy
>>> to maintain.
>>>
>>> I seem to have accidentally discovered Ahmdal's law. So I propose we
>>> come up with concrete objectives to benchmarks and if it's faster we
>>> proceed. Also, typing matters.
>>>
>>> And I'm not sure about how reliable the posteriorDB estimates are, but
>>> locally in Stan/math parallelization with tbb was faster within limits
>>> (#threads matters, etc) but if running this on exp with many evaluations of
>>> a gaussian distribution for example for thousands of iterations this could
>>> be worth it. But to play devil's advocate, recollecting threads could also
>>> also slow it down.
>>>
>>> Again, I'm handicapped no computer.
>>>
>>> But in summary, I don't think the linked thread effectively evaluates
>>> whether this is faster or not. All HPC devs use threading, no? Any ringers
>>> we can bring in?
>>>
>>> But wrt maintenance, easy.
>>>
>>> Best,
>>>
>>>
>>> Andre Zapico
>>> linkedin.com/in/andre-zapico
>>> gitub.com/drezap
>>>
>>>
>>> ME Information and Communication Engineering
>>> University of Electronic Science and Technology of China
>>>
>>> Consultant, Owner
>>> likely llc
>>> likelyllc.com
>>>
>>> Stan Developer
>>> mc-stan.org
>>>
>>> BS Mathematical Sciences: Probabilistic Methods
>>> BS Statistics
>>> University of Michigan, Ann Arbor 2017
>>>
>>> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
>>> wrote:
>>>
>>>> *SteveBronder* left a comment (stan-dev/math#3314)
>>>> <#3314 (comment)>
>>>>
>>>> I'm reposting my reply from here
>>>> <#3314 (comment)>
>>>>
>>>> The main issue is, when does the overhead of threading make it worth
>>>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>>>> threads should be used for exp(x)? None? Two? Threading has a decently high
>>>> overhead cost and for each instantiation of threading you pay for that. So
>>>> for small problems the answer is most likely single threaded. When we are
>>>> trying to add automatic parallelization we will need to ask at runtime "how
>>>> many threads should this operation use given the data size?" that requires
>>>> understanding a lot of information about how fast a users particular
>>>> machine can calculate a function and is a pretty hard problem. Eigen does
>>>> this, but only for matrix multiplication as they have very good runtime
>>>> logic to detect if sharding a large matrix multiply across threads is worth
>>>> it.
>>>>
>>>> The logic for deciding at runtime whether a particular function is
>>>> worth moving over to the gpu is going to be a lot of developer and runtime
>>>> overhead. imo I think the maintanence would not be worth it.
>>>>
>>>> This has been attempted previously be @andrjohns
>>>> <https://github.com/andrjohns> (and I took a crack at it myself). You
>>>> can see that whole conversation here
>>>> <#1918>
>>>>
>>>> —
>>>> Reply to this email directly, view it on GitHub
>>>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
>>>> or unsubscribe
>>>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
>>>> .
>>>> You are receiving this because you were mentioned.Message ID:
>>>> ***@***.***>
>>>>
>>>
>>>
>>>
>>>
>>>
>
>
|
|
Ok, I am reading through the old threading discussion a bit more
thoroughly. It's cool but many degrees of freedom and would be better to
specifically define what we're trying to thread? Something as simple as
concurrency in an operation that requires a lot of FLOPS could potentially
add some speed. And then isolate auto diff later? The conversation is going
in a bunch of different direction and it's not concrete as to what we're
trying to do. But systematically threading simple functions and evaluations
of values for PDFs might be a starting point. if that adds speed, sure. But
then threading auto diff is a different problem. But starting simple on an
iterative algorithm could add cumulative gains.
See what I'm saying? so there's a concrete gain as opposed to a convoluted
research question?
So ok, thread this, benchmark on all PDFs, and then continue. Just
evaluation, not gradients, then we could mess with auto diff more.
Not sure what percentage or proportion within stans HMC is purely just
evaluation but I think it's non negligible and could speed up.
And then focus on auto diff after. Sound stupid?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 9:04 PM Andre Zapico ***@***.***> wrote:
Here, I found this informative.
***@***.***/parallel-reduction-in-cuda-bba5e3d124b9
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 8:07 PM Andre Zapico ***@***.***> wrote:
> What I am suggesting is we ignore MCMC for now, and just go with runtime
> at evaluating prob distributions. Pretty much all of them use an
> exponential. So if there's a slight gain on evaluating computations then
> it's totally worth it to add, no? I think a lot of developers do this under
> the hood but don't expose it to users. Do the threads navigate through
> composite functions (i.e. normal distribution)? no idea. but the tests I
> ran seemed to improve performance, if we're not considering auto diff. they
> passed tests. I am going for performance, not fancy publications if that
> makes sense. But I'm sure devs do this under the hood for game dev etc. The
> code I added was only a few lines, and some declaratives.
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote:
>
>> And I answered these questions with my benchmarks, so it's not a big
>> mystery: Amdahl's law seems to apply.
>>
>>
>> *SteveBronder* left a comment (stan-dev/math#3314)
>> <#3314 (comment)>
>>
>> I'm reposting my reply from here
>> <#3314 (comment)>
>>
>> The main issue is, when does the overhead of threading make it worth
>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>> threads should be used for exp(x)? None? Two? Threading has a decently high
>> overhead cost and for each instantiation of threading you pay for that. So
>> for small problems the answer is
>>
>>
>> Best,
>>
>>
>> Andre Zapico
>> linkedin.com/in/andre-zapico
>> gitub.com/drezap
>>
>>
>> ME Information and Communication Engineering
>> University of Electronic Science and Technology of China
>>
>> Consultant, Owner
>> likely llc
>> likelyllc.com
>>
>> Stan Developer
>> mc-stan.org
>>
>> BS Mathematical Sciences: Probabilistic Methods
>> BS Statistics
>> University of Michigan, Ann Arbor 2017
>>
>> On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***>
>> wrote:
>>
>>> And again, in AnderJohns thread I'm not seeing any direct comparisons
>>> between threaded and unthreaded. I.e. there's no control and treatment
>>> group. We can't just guess. We need to systematically evaluate it.
>>>
>>> Best,
>>>
>>>
>>> Andre Zapico
>>> linkedin.com/in/andre-zapico
>>> gitub.com/drezap
>>>
>>>
>>> ME Information and Communication Engineering
>>> University of Electronic Science and Technology of China
>>>
>>> Consultant, Owner
>>> likely llc
>>> likelyllc.com
>>>
>>> Stan Developer
>>> mc-stan.org
>>>
>>> BS Mathematical Sciences: Probabilistic Methods
>>> BS Statistics
>>> University of Michigan, Ann Arbor 2017
>>>
>>> On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***>
>>> wrote:
>>>
>>>> It's like some declarations, which are essentially just if statements
>>>> that determine whether a certain area of code will be compiled or not.
>>>>
>>>> I'm skimming this I'm waiting on a bootloader for a free Mac I got.
>>>>
>>>> I'm not seeing any direct comparisons between threaded and non
>>>> threaded code, and there seems to be a discrepancy between concurrency and
>>>> running a process on different cores.
>>>>
>>>> I'm with Bob:
>>>> #1918 (comment)
>>>>
>>>> Instead of chatting, let's come up with a concrete way of determining
>>>> whether something is faster.
>>>>
>>>> WRT maintenance, it's like 3 lines of code and some declaratives. Easy
>>>> to maintain.
>>>>
>>>> I seem to have accidentally discovered Ahmdal's law. So I propose we
>>>> come up with concrete objectives to benchmarks and if it's faster we
>>>> proceed. Also, typing matters.
>>>>
>>>> And I'm not sure about how reliable the posteriorDB estimates are, but
>>>> locally in Stan/math parallelization with tbb was faster within limits
>>>> (#threads matters, etc) but if running this on exp with many evaluations of
>>>> a gaussian distribution for example for thousands of iterations this could
>>>> be worth it. But to play devil's advocate, recollecting threads could also
>>>> also slow it down.
>>>>
>>>> Again, I'm handicapped no computer.
>>>>
>>>> But in summary, I don't think the linked thread effectively evaluates
>>>> whether this is faster or not. All HPC devs use threading, no? Any ringers
>>>> we can bring in?
>>>>
>>>> But wrt maintenance, easy.
>>>>
>>>> Best,
>>>>
>>>>
>>>> Andre Zapico
>>>> linkedin.com/in/andre-zapico
>>>> gitub.com/drezap
>>>>
>>>>
>>>> ME Information and Communication Engineering
>>>> University of Electronic Science and Technology of China
>>>>
>>>> Consultant, Owner
>>>> likely llc
>>>> likelyllc.com
>>>>
>>>> Stan Developer
>>>> mc-stan.org
>>>>
>>>> BS Mathematical Sciences: Probabilistic Methods
>>>> BS Statistics
>>>> University of Michigan, Ann Arbor 2017
>>>>
>>>> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
>>>> wrote:
>>>>
>>>>> *SteveBronder* left a comment (stan-dev/math#3314)
>>>>> <#3314 (comment)>
>>>>>
>>>>> I'm reposting my reply from here
>>>>> <#3314 (comment)>
>>>>>
>>>>> The main issue is, when does the overhead of threading make it worth
>>>>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>>>>> threads should be used for exp(x)? None? Two? Threading has a decently high
>>>>> overhead cost and for each instantiation of threading you pay for that. So
>>>>> for small problems the answer is most likely single threaded. When we are
>>>>> trying to add automatic parallelization we will need to ask at runtime "how
>>>>> many threads should this operation use given the data size?" that requires
>>>>> understanding a lot of information about how fast a users particular
>>>>> machine can calculate a function and is a pretty hard problem. Eigen does
>>>>> this, but only for matrix multiplication as they have very good runtime
>>>>> logic to detect if sharding a large matrix multiply across threads is worth
>>>>> it.
>>>>>
>>>>> The logic for deciding at runtime whether a particular function is
>>>>> worth moving over to the gpu is going to be a lot of developer and runtime
>>>>> overhead. imo I think the maintanence would not be worth it.
>>>>>
>>>>> This has been attempted previously be @andrjohns
>>>>> <https://github.com/andrjohns> (and I took a crack at it myself).
>>>>> You can see that whole conversation here
>>>>> <#1918>
>>>>>
>>>>> —
>>>>> Reply to this email directly, view it on GitHub
>>>>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
>>>>> or unsubscribe
>>>>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
>>>>> .
>>>>> You are receiving this because you were mentioned.Message ID:
>>>>> ***@***.***>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
|
Yes this is what stan-perf is specifically built for. It allows you to use google benchmark with the stan math library and a branch to see how performance varies. You can see in the examples in that repo I use multiple sizes of matrices and you can also use google benchmark to benchmark a varying number of threads / matrix sizes.
What goes in the if statement is the question. For instance, in Eigen they have ways to query information about the cpu to determine ballparks for whether it is worth dispatching to parallel versions of matrix multiplication. To do this well we would need something similar and that is a lot of code that is very hairy. Then there is also the question of how many threads you should use for a given operation. The if statement will not just be
For this we want to use googlebenchmark via
I'm honestly rather confused about what you are looking to do. As I've said before, handling the edge cases around parallelization for simple unary and binary etc. operations is actually pretty difficult. Else we would have done this quite a while ago. And imo that level of parallelization is something that would be better to have in Eigen rather than Stan math. To make the overhead cost of spinning up threads worth it you would want to chain together many operations one that thread so that the computation is worth it (Amdahl's Law). That is the reason why we have reduce sum since it gives the user the ability to break an lpdf into smaller batches to compute on multiple threads. Doing that chunking automatically is a pretty large challenge. |
|
I am not sure matrix operations are the best way to test whether
parallelization is effective in the math library. Is Cholesky decomp
parallelizable? No, it's recursive. Block diagonal Cholesky, sure, since
you can decompose each block in parallel. Is gauss Jordan elimination
parallelizable? I don't think so.
Re: confusion: I'm looking to throw threads at anything possible.
And not everything that's parallelizable requires a reduce sum. And it's
also possible to abstract that away from the user.
Re: how many threads. We can get an approximation via MC simulation, am I
right? This is the point.
What PDFs have you tried parallelizing which functions? Not sure how deep
the threads travel. But if a log likelihood evaluation per iteration in an
MCMC sampler has different parameters estimates every time (until
convergence) not really parallelizable.
I'm thinking every vectorized operation can be parallelized and we can
evaluate #threads through simulation, as a proxy, not a proof, and it can
be abstracted away from the user.
And again, matrix operations are not the best "benchmark" for
parallelization. I'm thinking anything vectorized (i.e. performing the same
operation multiple times), you can't really parallelize something
recursive, no? You're just adding overhead and collecting threads, etc.
What all did you benchmark on your Stan perf report besides matrix
factorizations? I can't look at this rn no computer.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Thu, May 28, 2026, 12:46 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I'm with Bob:
#1918 (comment)
<#1918 (comment)>
Instead of chatting, let's come up with a concrete way of determining
whether something is faster.
Yes this is what stan-perf <https://github.com/SteveBronder/stan-perf> is
specifically built for. It allows you to use google benchmark with the stan
math library and a branch to see how performance varies. You can see in the
examples in that repo I use multiple sizes of matrices and you can also use
google benchmark to benchmark a varying number of threads / matrix sizes.
WRT maintenance, it's like 3 lines of code and some declaratives. Easy to
maintain.
What goes in the if statement is the question. For instance, in Eigen they
have ways to query information about the cpu to determine ballparks for
whether it is worth dispatching to parallel versions of matrix
multiplication. To do this well we would need something similar and that is
a lot of code that is very hairy. Then there is also the question of how
many threads you should use for a given operation. The if statement will
not just be if (N > some_number) -> parallel.
What I am suggesting is we ignore MCMC for now, and just go with runtime at
evaluating prob distributions.
For this we want to use googlebenchmark via stan-perf to test the
exponential function performance directly.
See what I'm saying? so there's a concrete gain as opposed to a convoluted
research question?
I'm honestly rather confused about what you are looking to do. As I've
said before, handling the edge cases around parallelization for simple
unary and binary etc. operations is actually pretty difficult. Else we
would have done this quite a while ago. And imo that level of
parallelization is something that would be better to have in Eigen rather
than Stan math. To make the overhead cost of spinning up threads worth it
you would want to chain together many operations one that thread so that
the computation is worth it (Amdahl's Law). That is the reason why we have
reduce sum since it gives the user the ability to break an lpdf into
smaller batches to compute on multiple threads. Doing that chunking
automatically is a pretty large challenge.
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543FJA7BM6FZOW3YXATT45BUNDA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJWGYZDQNRQGQ4KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4566286048>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543HQD3XLNHZ43OQLKOL45BUNDAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNRWGI4DMMBUHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I think it would be helpful if you responded inline with quotes as I'm having a hard time following your messages here
I don't understand how this is related to your PR's goal. The matrix multiply example in stan-perf is an example. You can then add your own examples for parallelism.
There is a real cost to spinning up threads both in terms of overhead compute and hardware resources. We need to be very mindful about this.
Many useful cases, that we can do reverse mode on, require the use of reduce sum. As of now it is one of the few ways we know how to do reverse mode autodiff in parallel. I think a simple place to start with your project is doing the google benchmark for the exponential function as you have it with varying thread and vector counts.
No we cannot. The thing we care about is the literal CPU, how many cores it has, L1, L2 cache size, size of the vector, is it an arm, x86, etc. and what SIMD is available. If this was doing a few simulations and if statements I promise we would have done this already.
The reason we have
See my above about the different levers in play here. This is a large task and imo not one that the Stan math library wants to maintain.
The example in the stan-perf repo tests Struct of Array and Array of Struct matrices, it is not testing parallelism.
You can see the different benchmarks in the I'm going to close this until we have a more definite idea of what we want to do. If you have interest in this I think it would be better to start with an issue along with google benchmark code that shows the performance of your idea. fyi this does not remove your code it is still at |
|
I've already benchmarked it and answered several of your questions. I have
an issue open already. Your matrix multiplication benchmarks are in no way
comprehensive about answering whether parallelizing functions will scale.
And nothing I have built for this project, or any project, has ever needed
maintainenance.
If you want to take a look at the "levers in play," you can take a look at
me benchmarking certain parallelization parameters (i.e. #threads, block
size) which will give some insight as to how the different levers perform.
You can run multiple threads on one core.
I am talking specifically about parallelism. So why are you pointing me to
this report?
And sure, if your "levers" are something you'd like to evaluate, please
itemize them.
I've already done some tests to show overhead of initiating threads.
What levers would you like to see pulled?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Thu, May 28, 2026, 2:56 PM Steve Bronder ***@***.***> wrote:
Closed #3314 <#3314>.
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543FRFK5X4I4JHHGVSPD45CDV7A5CNFSNUABQM5UWIORPF5TWS5BNNB2WEL2JONZXKZKFOZSW45CON52GSZTJMNQXI2LPNYXTENRQG43TQMRRGQ2TJJTSMVQXG33OU5WWK3TUNFXW5JLFOZSW45FMMZXW65DFOJPWG3DJMNVQ#event-26077821454>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543BTNXNJESRPD3ED4DL45CDV7AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRWGA3TOOBSGE2DKNA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@Steve Bronder ***@***.***> thoughts?
I am not sure why you're talking about not parallelization when this is the
main topic I'm talking about. Does not make sense.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Thu, May 28, 2026, 3:38 PM Andre Zapico ***@***.***> wrote:
I've already benchmarked it and answered several of your questions. I have
an issue open already. Your matrix multiplication benchmarks are in no way
comprehensive about answering whether parallelizing functions will scale.
And nothing I have built for this project, or any project, has ever needed
maintainenance.
If you want to take a look at the "levers in play," you can take a look at
me benchmarking certain parallelization parameters (i.e. #threads, block
size) which will give some insight as to how the different levers perform.
You can run multiple threads on one core.
I am talking specifically about parallelism. So why are you pointing me to
this report?
And sure, if your "levers" are something you'd like to evaluate, please
itemize them.
I've already done some tests to show overhead of initiating threads.
What levers would you like to see pulled?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Thu, May 28, 2026, 2:56 PM Steve Bronder ***@***.***>
wrote:
> Closed #3314 <#3314>.
>
> —
> Reply to this email directly, view it on GitHub
> <#3314?email_source=notifications&email_token=ACY543FRFK5X4I4JHHGVSPD45CDV7A5CNFSNUABQM5UWIORPF5TWS5BNNB2WEL2JONZXKZKFOZSW45CON52GSZTJMNQXI2LPNYXTENRQG43TQMRRGQ2TJJTSMVQXG33OU5WWK3TUNFXW5JLFOZSW45FMMZXW65DFOJPWG3DJMNVQ#event-26077821454>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACY543BTNXNJESRPD3ED4DL45CDV7AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRWGA3TOOBSGE2DKNA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Can you point me to them? I am not seeing google benchmarks in this thread.
Can you show me a file in Stan math you wrote that does not later have contributors? Looking at the
I want an actual google benchmark.
What report are you talking about?
A good start would be experiments that check as the size of a vector and number of threads vary how does that affect performance relative to serial execution of the function. You can use the I've directed you several times to the |
|
Give me a clear ordered list of what you would like to see.
There has been minimal changes to the GP functions I've written. Yes, I am
aware code requires maintainance.
Go ahead and run .run tests.py test/unit/math/prim/fun/exp_test both with
and without STAN_THREADS=TRUE, in make local and you will see a huge
performance gain, as well as an example of Amdahl's law in practice.
Again, not sure why you are referencing benchmarks that have nothing to do
with parallelization when this is what we are discussing.
Again, I am working on obtaining a functional machine for programming.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Thu, May 28, 2026, 4:28 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I've already benchmarked it and answered several of your questions. I have
an issue open already. Your matrix multiplication benchmarks are in no way
comprehensive about answering whether parallelizing functions will scale.
Can you point me to them? I am not seeing google benchmarks in this thread.
And nothing I have built for this project, or any project, has ever needed
maintainenance.
Can you show me a file in Stan math you wrote that does not later have
contributors? Looking at the git blame for the gp functions they have
been refactored and rewritten over the years. All code adds maintenance.
If you want to take a look at the "levers in play," you can take a look at
me benchmarking certain parallelization parameters (i.e. #threads, block
size) which will give some insight as to how the different levers perform.
I want an actual google benchmark.
You can run multiple threads on one core.
I am talking specifically about parallelism. So why are you pointing me to
this report?
What report are you talking about? stan-perf is a repository with
everything setup so that you can use google benchmark for building
benchmark experiments. I feel like you are not reading or looking at the
resources I am sending you.
And sure, if your "levers" are something you'd like to evaluate, please
itemize them.
I've already done some tests to show overhead of initiating threads.
What levers would you like to see pulled?
A good start would be experiments that check as the size of a vector and
number of threads vary how does that affect performance relative to serial
execution of the function. You can use the stan-perf
<https://github.com/SteveBronder/stan-perf> repository to setup your
benchmarks and execute them. You can fork the stan-perf repository, write
your benchmarks, make some nice graphs, and share your code and results.
I've directed you several times to the stan-perf repository to setup your
benchmarking code. Is there something I'm missing as to why you do not go
and build the benchmarks via that repo and run them? Using
google/benchmark will give you consistent and sharable results. You can
also output the results to csv/json so you can make nice plots in R or
python.
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543HXC4X4PW4W54MA2OL45COOHA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJWG44TGMBQGQZKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4567930042>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543EZC5GG2RFPE5LSPKT45COOHAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNRXHEZTAMBUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I'm very confused. Your benchmarks are very ad hoc and Im asking for more formal benchmarks. I feel I'm not being very heard. You can read the stan-perf repo and post clear benchmarks using the Google benchmark suite provided in that repo. |
|
Give me list of parameters you'd like to be benchmarked. Keep in mind, when
varied in conjunction this can effect benchmarking results.
Can you provide an enum (an ordered enumerated list) of what you'd like to
see benchmarked, so we're all not wasting time?
Thanks!
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Thu, May 28, 2026, 8:51 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I'm very confused. Your benchmarks are very ad hoc and Im asking for more
eformal benchmarks. I feel I'm not being very heard. You can read the
stan-perf repo and post clear benchmarks using the Google benchmark suite
provided in that repo.
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543G6HVSYS4WGIMMFATL45DNJXA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJWHE2TANZUGEZ2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4569507413>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543HZZSCM4ZCOMOW76G345DNJXAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNRZGUYDONBRGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I would want vectors from size 2 to 32,768 by powers of 2 (so 2, 4, 8, 16,...). Then I would want threads from 1, 2, 4, ... up to the max threads on your computer. If you use stan-perf google benchmark has the tooling setup for this |
|
And what parameters, threads, block size, randomly generated number
datasets, what PDFs?
If you run my tests I've done something similar.
Can you be more rigorous about what you would like?
Thanks!
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Fri, May 29, 2026, 2:02 AM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I would want vectors from size 2 to 32,768 by powers of 2 (so 2, 4, 8,
16,...). Then I would want threads from 1, 2, 4, ... up to the max threads
on your computer. If you use stan-perf google benchmark has the tooling
setup for this
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543GJV32WT3B6JYTJHPL45ERXTA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXGEZDSMJRHEY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4571291191>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543DTKSGDRV3FDN3PX3D45ERXTAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZRGI4TCMJZGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Give me columns and rows of a table you'd like to see and I'll benchmark it.
I.e. #threads,block size,
Rows: datasetsize,
Etc.
See what I'm saying?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Fri, May 29, 2026, 12:55 PM Andre Zapico ***@***.***> wrote:
And what parameters, threads, block size, randomly generated number
datasets, what PDFs?
If you run my tests I've done something similar.
Can you be more rigorous about what you would like?
Thanks!
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Fri, May 29, 2026, 2:02 AM Steve Bronder ***@***.***>
wrote:
> *SteveBronder* left a comment (stan-dev/math#3314)
> <#3314 (comment)>
>
> I would want vectors from size 2 to 32,768 by powers of 2 (so 2, 4, 8,
> 16,...). Then I would want threads from 1, 2, 4, ... up to the max threads
> on your computer. If you use stan-perf google benchmark has the tooling
> setup for this
>
> —
> Reply to this email directly, view it on GitHub
> <#3314?email_source=notifications&email_token=ACY543GJV32WT3B6JYTJHPL45ERXTA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXGEZDSMJRHEY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4571291191>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACY543DTKSGDRV3FDN3PX3D45ERXTAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZRGI4TCMJZGE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
I really don't see what you are saying here. Again if you look at the example benchmarks in the stan-perf repo it should be reasonable to fork the repo and copy the current benchmarks into a new folder to run them so that you can run the benchmarks over the serial and parallel exponential function. Your code only changes prim so you only need to use doubles |
|
I feel like you are not listening to what I have said here or read the stan-perf repo. At this point it feels like we are talking in circles. If you have specific issues with getting the Stan perf repo running ping me but otherwise I have nothing more to add here |
|
What exactly would you like to see evaluated? Be precise.
I feel you're being intentionally vague.
Which levers do you need pulled?
Probably the optimization surface is multimodal but we can use some convex
optimization via CVXpy or something to find a local max that could
definitely improve performance and could be hidden from users so there's
not questions as to how to run something.
Give me a table to fill out with benchmarks.
Clear, or no?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Fri, May 29, 2026, 1:54 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I really don't see what you are saying here. Again if you look at the
example benchmarks in the stan-perf repo it should be reasonable to fork
the repo and copy the current benchmarks into a new folder to run them so
that you can run the benchmarks over the serial and parallel exponential
function. Your code only changes prim so you only need to use doubles
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543FPD5YSE6PMUEB3WET45HFDXA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXHAYTOOBZGA3KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4578178906>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543EDH7Q5DASOMHLNYGT45HFDXAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZYGE3TQOJQGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
You are way overthinking this. I want you to use the Google benchmark library to benchmark only the exponential function for the dimensions and threads I specified above. Nothing else. The stan-perf repo has what you need. |
|
I don't think you're being specific enough. You've clearly negated things I
have shown and adding additional boundaries to contributing.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Fri, May 29, 2026, 2:27 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
You are way overthinking this. I want you to use the Google benchmark
library to benchmark *only* the exponential function for the dimensions
and threads I specified above. Nothing else. The stan-perf repo has what
you need.
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543DQDPSVP6SKD4IGYID45HI7ZA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXHA2TKMZSGI3KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4578553226>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543HDGOBEQK5R3ND6QB345HI7ZAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZYGU2TGMRSGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I am sorry for incorrect grammar, ignoring** things that I have shown.
You clearly haven't investigated my benchmarks or given any clear feedback
as to what "levers" you'd like to pull.
I'm offering my time for free to improve the performance of the math
library.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Fri, May 29, 2026, 2:33 PM Andre Zapico ***@***.***> wrote:
I don't think you're being specific enough. You've clearly negated things
I have shown and adding additional boundaries to contributing.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Fri, May 29, 2026, 2:27 PM Steve Bronder ***@***.***>
wrote:
> *SteveBronder* left a comment (stan-dev/math#3314)
> <#3314 (comment)>
>
> You are way overthinking this. I want you to use the Google benchmark
> library to benchmark *only* the exponential function for the dimensions
> and threads I specified above. Nothing else. The stan-perf repo has what
> you need.
>
> —
> Reply to this email directly, view it on GitHub
> <#3314?email_source=notifications&email_token=ACY543DQDPSVP6SKD4IGYID45HI7ZA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXHA2TKMZSGI3KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4578553226>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACY543HDGOBEQK5R3ND6QB345HI7ZAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZYGU2TGMRSGY>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
And you think principled benchmarks are overthinking?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Fri, May 29, 2026, 2:35 PM Andre Zapico ***@***.***> wrote:
I am sorry for incorrect grammar, ignoring** things that I have shown.
You clearly haven't investigated my benchmarks or given any clear feedback
as to what "levers" you'd like to pull.
I'm offering my time for free to improve the performance of the math
library.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Fri, May 29, 2026, 2:33 PM Andre Zapico ***@***.***> wrote:
> I don't think you're being specific enough. You've clearly negated things
> I have shown and adding additional boundaries to contributing.
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Fri, May 29, 2026, 2:27 PM Steve Bronder ***@***.***>
> wrote:
>
>> *SteveBronder* left a comment (stan-dev/math#3314)
>> <#3314 (comment)>
>>
>> You are way overthinking this. I want you to use the Google benchmark
>> library to benchmark *only* the exponential function for the dimensions
>> and threads I specified above. Nothing else. The stan-perf repo has what
>> you need.
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#3314?email_source=notifications&email_token=ACY543DQDPSVP6SKD4IGYID45HI7ZA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXHA2TKMZSGI3KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4578553226>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACY543HDGOBEQK5R3ND6QB345HI7ZAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZYGU2TGMRSGY>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
|
Yes absolutely. I want you to use the Google benchmark library to do a microbenchmark of only the exponential function for the dimensions and threads I specified above. Nothing else. The stan-perf repo has what you need. |
|
You did not explicitly specify. Please specify in the following response.
Give me an "enum," or an ordered list. With this, we can come up with at
least a proxy that will give performance gains for users, that can be
extracted away from them so they're not confused about the male/local
directory.
Thank you.
I have already exhibited Ahmdal's Law. Moreover, we're not investing
whether threads travel through composite functions, etc.
I'm up to incorporate stan-perf into Stan/math benchmarks pipeline. I've
done some YAML.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Fri, May 29, 2026, 3:19 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
And you think principled benchmarks are overthinking?
Yes absolutely. I want you to use the Google benchmark library to do a
microbenchmark of only the exponential function for the dimensions and
threads I specified above. Nothing else. The stan-perf repo has what you
need.
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543BV3B2QLKITRF2F5B345HPC5A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXHEYDCNBYGM42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4579014839>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543A3C4JYQLWMYU4AE7345HPC5AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZZGAYTIOBTHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@wds15, Sebastian, I forget your handle.
Thoughts? What columns/covariates would you like to see identified before
this moves further? Always up to cancel a useless project, but this could
add major gains and it's passing all unit tests.
Two commits before was the fastest. I can be more clear/specific with
benchmarks. I should be back on a computer Tomorrow.
I was looking at #threads, block size, and also fixing and scaling N. also
not sure how this will behave with composite functions (every single PDF).
ignoring auto diff for now, but just function evaluation threaded could
potentially reduce execution time.
The tests pass for rev/mix/prim on the fastest implementation, and i'm
using declaratives but if there was a serious merge there could be another
directory like the openCL GPU stuff. I'm around watching this I'd maintain
it. I think it's like 7:30 Berlin time right now.
Thoughts on more rigorous evaluation?
And then doing something more fancy threading rev would be sweet, but I
don't like to spend time on projects that are not thoroughly evaluatee
whether they are possible or not.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Fri, May 29, 2026, 7:46 PM Andre Zapico ***@***.***> wrote:
You did not explicitly specify. Please specify in the following response.
Give me an "enum," or an ordered list. With this, we can come up with at
least a proxy that will give performance gains for users, that can be
extracted away from them so they're not confused about the male/local
directory.
Thank you.
I have already exhibited Ahmdal's Law. Moreover, we're not investing
whether threads travel through composite functions, etc.
I'm up to incorporate stan-perf into Stan/math benchmarks pipeline. I've
done some YAML.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Fri, May 29, 2026, 3:19 PM Steve Bronder ***@***.***>
wrote:
> *SteveBronder* left a comment (stan-dev/math#3314)
> <#3314 (comment)>
>
> And you think principled benchmarks are overthinking?
>
> Yes absolutely. I want you to use the Google benchmark library to do a
> microbenchmark of only the exponential function for the dimensions and
> threads I specified above. Nothing else. The stan-perf repo has what you
> need.
>
> —
> Reply to this email directly, view it on GitHub
> <#3314?email_source=notifications&email_token=ACY543BV3B2QLKITRF2F5B345HPC5A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXHEYDCNBYGM42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4579014839>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACY543A3C4JYQLWMYU4AE7345HPC5AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZZGAYTIOBTHE>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
Alright hold on. new PC. I am just doing basic stuff browsing files. I have
12 cores and I'm using about 2500 threads. Literally nothing but OS updates
and browsing. Still can't get into GitHub. But clearly the OS devs are
threading under the hood and masking it from users.
Let me come up with a concrete list of what's happening and see if you like
it. I think Steve had some good insight as to what's good to test? Again
this is like one 5 line class. Easy to maintain and I won't disappear.
Before just: fix dataset size, vary #threads, block size, (something to
else),
And then I scaled dataset size, fixing block size and varied threads. So
there's this combinatorially increasing # of plots, which is why I didn't
feel like plotting it.
Once GitHub lets me back in, want me to give you a laundry list?
And the I'm open to more people with operating system (OS) dev experience
chiming in.
But I think capitalizing on this could potentially add gains. And not
really much maintenance, I could get prim/rev/mix tests on Travis to pass
over night. We just have to worry about stupid stuff like extra copies etc.
Deleting unused stuff (on the stack?) after we exit the function. Not sure
if C++ has an automatic garbage cleaner like python? You guys would know.
In C sometimes people were deleting memory and sometimes not there was not
a concrete rule.
Thoughts? I think Breve Stonder would have more insight as what to test.
Again, happy to run benchmarks and then may be add stan-perf to the
pipeline.
But also I am thinking whether threads travel through composite functions?
Simple stuff like that.
Then I could worry about rev mode.
Sound silly or no? IDK I am math background. Open to any feedback.
And I still can't get into GitHub I am bothering their team. But yeah
GitHub what's going on?
Cheers everyone.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Sat, May 30, 2026, 1:21 AM Andre Zapico ***@***.***> wrote:
@wds15, Sebastian, I forget your handle.
Thoughts? What columns/covariates would you like to see identified before
this moves further? Always up to cancel a useless project, but this could
add major gains and it's passing all unit tests.
Two commits before was the fastest. I can be more clear/specific with
benchmarks. I should be back on a computer Tomorrow.
I was looking at #threads, block size, and also fixing and scaling N. also
not sure how this will behave with composite functions (every single PDF).
ignoring auto diff for now, but just function evaluation threaded could
potentially reduce execution time.
The tests pass for rev/mix/prim on the fastest implementation, and i'm
using declaratives but if there was a serious merge there could be another
directory like the openCL GPU stuff. I'm around watching this I'd maintain
it. I think it's like 7:30 Berlin time right now.
Thoughts on more rigorous evaluation?
And then doing something more fancy threading rev would be sweet, but I
don't like to spend time on projects that are not thoroughly evaluatee
whether they are possible or not.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Fri, May 29, 2026, 7:46 PM Andre Zapico ***@***.***> wrote:
> You did not explicitly specify. Please specify in the following response.
> Give me an "enum," or an ordered list. With this, we can come up with at
> least a proxy that will give performance gains for users, that can be
> extracted away from them so they're not confused about the male/local
> directory.
>
> Thank you.
>
> I have already exhibited Ahmdal's Law. Moreover, we're not investing
> whether threads travel through composite functions, etc.
>
> I'm up to incorporate stan-perf into Stan/math benchmarks pipeline. I've
> done some YAML.
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Fri, May 29, 2026, 3:19 PM Steve Bronder ***@***.***>
> wrote:
>
>> *SteveBronder* left a comment (stan-dev/math#3314)
>> <#3314 (comment)>
>>
>> And you think principled benchmarks are overthinking?
>>
>> Yes absolutely. I want you to use the Google benchmark library to do a
>> microbenchmark of only the exponential function for the dimensions and
>> threads I specified above. Nothing else. The stan-perf repo has what you
>> need.
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#3314?email_source=notifications&email_token=ACY543BV3B2QLKITRF2F5B345HPC5A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJXHEYDCNBYGM42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4579014839>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACY543A3C4JYQLWMYU4AE7345HPC5AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNZZGAYTIOBTHE>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
>
>
|
Summary
I wrote a class that contains an
operatorfor exp, which allows use to usetbbfor parallelization of a for loop. It looks like at lower number of observations, the parallelization is marginal, but at higher number of observations the parallelism of the for loop, usingtbb::parallel_for, for example, at ~=32,000 there seems to be a speed up at 4 threads that sustains as we increase the size of theContainer.Tests
I tested for numerical accuracy, which checks out. Moreover, I did the following performance tests:
Side Effects
Yes. If we kick in threads too early, there's actually a slow down in computing
expon a vector with a lower number of observations. May be it would be good if there was a default min threads, or have them kick in only when dataset is a certain size. Moreover, this is just one function, so the result may be different when we have a composite function (Gaussian). I think this may be advantageous at lower number observations, but have not evaluated this.What I've done is added a directive that runs the multithreaded code for only vector, and calls the original code (but it's copy pasted into the STAN_THREADS section) accordingly if the function is not threaded for
exp. I'd be open to a quick re-factor if we wanted to set it up like openCL, and have athreadsdirectory understan\math\prim.Release notes
?
Checklist
Copyright holder: (Andre Zapico, Likely LLC, 2026)
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit)make test-headers)make test-math-dependencies)make doxygen)make cpplint)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested