Login  Register

Re: OpenMP - current usage in ql

Posted by Joseph Wang-4 on Jun 19, 2014; 10:26pm
URL: http://quantlib.414.s1.nabble.com/OpenMP-current-usage-in-ql-tp15458p15520.html

Thanks for the note.  I'll take a look at the code some time next week.  Since the loops are being vectorized, it looks like there isn't any benefit to parallelization and I'll see about taking the omp pragmas out.  


On Thu, Jun 19, 2014 at 11:47 PM, Peter Caspers <[hidden email]> wrote:
Hi,

I did some more tests. First of all I have to correct my numbers below
for 7 and 8 threads, when I rerun the example I get ~ 26s for both. I
don't know, what went wrong the last time. Anyway I would say in
lattice.hpp the parallel for is indeed useful giving a speed-up factor
of 4.6 on 8 cores in this example, that's fine.

Next I looked at the FDM code again in more detail. In
triplebandlinearop.cpp and ninepointlinearop.cpp all omp pragmas are
applied to loops of this kind

for (Size i=0; i < size; ++i) {
      diag[i]  = y_diag[i];
      lower[i] = y_lower[i];
      upper[i] = y_upper[i];
}

which I guess are already optimized by the compiler on one thread very
well and multithreading would only makes sense for very big sizes
(playing around with that in toy examples with loops of similar
complexity I see a speed up only with loop sizes around 1E+8 and
bigger). Indeed disabling the parallel for pragmas in the operator
classes does not change the performance of the test cases filtered by
--run_test='*/*/*Fd*', see below (only the overhead is avoided, which
seems desirable). It seems that these loops are all vectorized by the
compiler without having to do anything, because I am getting the same
running times when adding #pragma omp simd explicitly.

I am not sure about the parallelevolver.hpp and stepcondition.hpp, but
at least I don't see any benefit in the Fd test cases or the
BermudanSwaption Example (which I think also uses them (?)).

all #pragma omp enabled

8 threads real    1m29.793s user    8m15.733s
2 threads real    1m13.676s user    1m56.217s

disable triplebandlinearop.cpp

8 threads real    1m31.091s user    6m47.130s
2 threads real    1m15.548s user    1m43.742s

disable triplebandlinearop.cpp, ninepointlinearop.cpp

8 threads real    1m18.263s user    1m56.950s
2 threads real    1m15.677suser    1m16.592s

disable triplebandlinearop.cpp, ninepointlinearop.cpp,
parallelevolver.hpp, stepcondition.hpp

real    1m14.468s user    1m11.959s

Peter

On 16 June 2014 12:52, Peter Caspers <[hidden email]> wrote:
> yes, this example scales well up to 6 cores:
>
> without omp 1m20s
> threads=2, real=0m42s
> threads=4, real=0m36s
> threads=5, real=0m30s
> threads=6, real=0m27s
> threads=7, real=0m40s
> threads=8, real=2m53s
>
> Peter
>
>
> On 16 June 2014 09:45, Luigi Ballabio <[hidden email]> wrote:
>> Trees seem to fare better. On the BermudanSwaption example, the
>> elapsed time does halve on two cores (more or less). Peter, what do
>> you get on 4 or 8?
>> Also, there might be other factors that enter the equation (number of
>> cache lines, for example?)
>>
>> Luigi
>>
>>
>> On Sun, Jun 15, 2014 at 10:20 PM, Piter Dias <[hidden email]> wrote:
>>> Is there a chance of the test being too small? I remember that many years
>>> ago (during my Algorithmics days) we used to make tests as big as the client
>>> real portfolio in order to make server buying advices (mainly based on
>>> number of processors due to scenarios valuation).
>>>
>>> Would the speed-up factors significantly change if the base scenario runs
>>> for, lets say, 15 minutes? It would make fixed costs more negligible.
>>>
>>> Regards,
>>>
>>> _____________________
>>> Piter Dias
>>> [hidden email]
>>> www.piterdias.com
>>>
>>>
>>>
>>>> Date: Sun, 15 Jun 2014 20:11:28 +0200
>>>> From: [hidden email]
>>>> To: [hidden email]
>>>> CC: [hidden email]; [hidden email];
>>>> [hidden email]
>>>> Subject: Re: [Quantlib-dev] OpenMP - current usage in ql
>>>
>>>>
>>>> oh yes, my timings below are total CPU time rather than wall clock
>>>> time ( I usually measure the latter by just counting seconds in my
>>>> head ... ). That was unfair, sorry ! With the time command I get (for
>>>> the AmericanOptionTest)
>>>>
>>>> g++ -O3 -fopenmp
>>>>
>>>> OMP_NUM_THREADS=1 real = 1.925s
>>>> OMP_NUM_THREADS=2 real = 1.468s
>>>> OMP_NUM_THREADS=3 real = 1.590s
>>>> OMP_NUM_THREADS=4 real = 1.647s
>>>> OMP_NUM_THREADS=5 real = 1.780s
>>>> OMP_NUM_THREADS=6 real = 1.838s
>>>> OMP_NUM_THREADS=7 real = 2.081s
>>>> OMP_NUM_THREADS=8 real = 2.282s
>>>>
>>>> g++ -O3
>>>>
>>>> real = 1.638s
>>>>
>>>> still, the point is the same imo. WIth 8 cores I'd expect maybe a
>>>> speed-up factor of 4 to 6. What we instead see is something around 1
>>>> (often below 1 as it seems), so effectively all the additional cpu
>>>> time is eaten up by the overhead for multiple threads. That's not
>>>> worth it, is it ? I didn't try many optimizations with omp yet, but
>>>> what I see in "good" cases are the 4-6 above. I wouldn't parallelize
>>>> for much below.
>>>>
>>>> best regards
>>>> Peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 15 June 2014 17:18, <[hidden email]> wrote:
>>>> > Hi,
>>>> > Have you tried to use only the 4 physical threads of your cpu? I dont
>>>> > use OpenMP but I use boost threads and hyperthreading does very weird
>>>> > things; one is that over 4 threads (in this case) scaling stops being
>>>> > linear, which makes sense. Luigi, your 2 cpus are physical, right?
>>>> > just a shot.
>>>> > Best
>>>> >
>>>> >
>>>> > ----- Original Message -----
>>>> >> Yes, the timing might be off. I suspect that the Boost timer is
>>>> >> reporting the total CPU time, that is, the sum of the actual time per
>>>> >> each CPU. On my box, if I run the BermudanSwaption example with
>>>> >> OpenMP
>>>> >> enabled, it outputs:
>>>> >>
>>>> >> Run completed in 2 m 35 s
>>>> >>
>>>> >> but if I call it through "time", I get an output like:
>>>> >>
>>>> >> real 1m19.767s
>>>> >> user 2m34.183s
>>>> >> sys 0m0.538s
>>>> >>
>>>> >> that is, total CPU time 2m34s, but real time 1m19s. Being the
>>>> >> untrusting individual that I am, I also timed it with a stopwatch.
>>>> >> The
>>>> >> elapsed time is actually 1m19s :)
>>>> >>
>>>> >> This said, I still see a little slowdown in the test cases Peter
>>>> >> listed. My times are:
>>>> >>
>>>> >> AmericanOptionTest: disabled 2.4s, enabled 3.4s (real time)
>>>> >> AsianOptionTest: disabled 10.6s, enabled 10.4s
>>>> >> BarrierOptionTest: disabled 4.9s, enabled 6.1s
>>>> >> DividendOptionTest: disabled 5.1s, enabled 6.5s
>>>> >> FdHestonTest: disabled 73.4s, enabled 76.8s
>>>> >> FdmLinearOpTest: disabled 11.4s, enabled 11.6s
>>>> >>
>>>> >> Not much, but a bit slower anyway. I've only got 2 CPUs though (and I
>>>> >> compiled with -O2). Peter, what do you get on your 8 CPUs if you run
>>>> >> the cases via "time"?
>>>> >>
>>>> >> Luigi
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Sun, Jun 15, 2014 at 3:40 PM, Joseph Wang <[hidden email]>
>>>> >> wrote:
>>>> >> >
>>>> >> > That's quite odd since OpenMP should not be causing such huge
>>>> >> > slowdowns.
>>>> >> >
>>>> >> > Since by default the items are not complied, I'd rather keep the
>>>> >> > pragma's
>>>> >> > there.
>>>> >> >
>>>> >> > Also is there any possibilities that the timing code is off?
>>>> >> >
>>>> >> >
>>>> >> > ------------------------------------------------------------------------------
>>>> >> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>> >> > Solutions
>>>> >> > Find What Matters Most in Your Big Data with HPCC Systems
>>>> >> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> >> > Leverages Graph Analysis for Fast Processing & Easy Data
>>>> >> > Exploration
>>>> >> > http://p.sf.net/sfu/hpccsystems
>>>> >> > _______________________________________________
>>>> >> > QuantLib-dev mailing list
>>>> >> > [hidden email]
>>>> >> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> <https://implementingquantlib.blogspot.com>
>>>> >> <https://twitter.com/lballabio>
>>>> >>
>>>> >>
>>>> >> ------------------------------------------------------------------------------
>>>> >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>> >> Solutions
>>>> >> Find What Matters Most in Your Big Data with HPCC Systems
>>>> >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>> >> http://p.sf.net/sfu/hpccsystems
>>>> >> _______________________________________________
>>>> >> QuantLib-dev mailing list
>>>> >> [hidden email]
>>>> >> https://lists.sourceforge.net/lists/listinfo/quantlib-dev
>>>> >>
>>>> >
>>>> >
>>>> > ------------------------------------------------------------------------------
>>>> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>> > Solutions
>>>> > Find What Matters Most in Your Big Data with HPCC Systems
>>>> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> > Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>> > http://p.sf.net/sfu/hpccsystems
>>>> > _______________________________________________
>>>> > QuantLib-dev mailing list
>>>> > [hidden email]
>>>> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>> http://p.sf.net/sfu/hpccsystems
>>>> _______________________________________________
>>>> QuantLib-dev mailing list
>>>> [hidden email]
>>>> https://lists.sourceforge.net/lists/listinfo/quantlib-dev
>>
>>
>>
>> --
>> <https://implementingquantlib.blogspot.com>
>> <https://twitter.com/lballabio>


------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
QuantLib-dev mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-dev