Hello,
enabling OpenMP slows down the test-suite on my maching significantly. Compiling with gcc 4.10.0 and O3 on my i7-2760QM@2.40GHz I get the timings below (left = OpenMP enabled, right = disabled), using all 8 threads. Am I doing something wrong here, does anyone gets a different picture ? If no I wonder, should we better remove the existing #pragmas in the fd and tree part of the library again ? Thanks a lot Peter Testing Barone-Adesi and Whaley approximation for American options... Testing Bjerksund and Stensland approximation for American options... Testing Ju approximation for American options... Testing finite-difference engine for American options... Testing finite-differences American option greeks... Testing finite-differences shout option greeks... Tests completed in 20.25 s / Tests completed in 1.63 s Testing analytic continuous geometric average-price Asians... Testing analytic continuous geometric average-price Asian greeks... Testing analytic discrete geometric average-price Asians... Testing analytic discrete geometric average-strike Asians... Testing Monte Carlo discrete geometric average-price Asians... Testing Monte Carlo discrete arithmetic average-price Asians... Testing Monte Carlo discrete arithmetic average-strike Asians... Testing discrete-averaging geometric Asian greeks... Testing use of past fixings in Asian options... Tests completed in 19.28 s / Tests completed in 6.16 s Testing barrier options against Haug's values... Testing barrier options against Babsiri's values... Testing barrier options against Beaglehole's values... Testing local volatility and Heston FD engines for barrier options... Tests completed in 13.86 s / Tests completed in 2.70 s Testing dividend European option values with no dividends... Testing dividend European option with a dividend on today's date... Testing dividend European option greeks... Testing finite-difference dividend European option values... Testing finite-differences dividend European option greeks... Testing finite-differences dividend American option greeks... Testing degenerate finite-differences dividend European option... Testing degenerate finite-differences dividend American option... Tests completed in 25.06 s / Tests completed in 3.55 s Testing FDM with barrier option for Heston model vs Black-Scholes model... Testing FDM with barrier option in Heston model... Testing FDM with American option in Heston model... Testing FDM Heston for Ikonen and Toivanen tests... Testing FDM Heston with Black Scholes model... Testing FDM with European option with dividends in Heston model... Testing FDM Heston convergence... Tests completed in 3 m 31.86 s / Tests completed in 44.90 s Testing indexing of a linear operator... Testing uniform grid mesher... Testing application of first-derivatives map... Testing application of second-derivatives map... Testing application of second-order mixed-derivatives map... Testing triple-band map solution... Testing FDM with barrier option in Heston model... Testing FDM with American option in Heston model... Testing FDM with express certificate in Heston model... Testing FDM with Heston Hull-White model... Testing bi-conjugated gradient stabilized algorithm with Heston operator... Testing Crank-Nicolson with initial implicit damping steps for a digital option... Testing SparseMatrixReference type... Testing assignment to zero in sparse matrix... Tests completed in 46.73 s / Tests completed in 6.63 s ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
Hi Peter,
I share your experiences, OpenMP slows down QL on my hardware as well. OpenMP might work for very large problems but for "normal" problems the overhead kills the speed-up. I'd rather remove it.
regards Klaus
On Saturday, June 14, 2014 08:10:55 PM Peter Caspers wrote: > Hello, > > enabling OpenMP slows down the test-suite on my maching significantly. > Compiling with gcc 4.10.0 and O3 on my i7-2760QM@2.40GHz I get the > timings below (left = OpenMP enabled, right = disabled), using all 8 > threads. > > Am I doing something wrong here, does anyone gets a different picture > ? If no I wonder, should we better remove the existing #pragmas in the > fd and tree part of the library again ? > > Thanks a lot > Peter > > Testing Barone-Adesi and Whaley approximation for American options... > Testing Bjerksund and Stensland approximation for American options... > Testing Ju approximation for American options... > Testing finite-difference engine for American options... > Testing finite-differences American option greeks... > Testing finite-differences shout option greeks... > > Tests completed in 20.25 s / Tests completed in 1.63 s > > Testing analytic continuous geometric average-price Asians... > Testing analytic continuous geometric average-price Asian greeks... > Testing analytic discrete geometric average-price Asians... > Testing analytic discrete geometric average-strike Asians... > Testing Monte Carlo discrete geometric average-price Asians... > Testing Monte Carlo discrete arithmetic average-price Asians... > Testing Monte Carlo discrete arithmetic average-strike Asians... > Testing discrete-averaging geometric Asian greeks... > Testing use of past fixings in Asian options... > > Tests completed in 19.28 s / Tests completed in 6.16 s > > Testing barrier options against Haug's values... > Testing barrier options against Babsiri's values... > Testing barrier options against Beaglehole's values... > Testing local volatility and Heston FD engines for barrier options... > > Tests completed in 13.86 s / Tests completed in 2.70 s > > Testing dividend European option values with no dividends... > Testing dividend European option with a dividend on today's date... > Testing dividend European option greeks... > Testing finite-difference dividend European option values... > Testing finite-differences dividend European option greeks... > Testing finite-differences dividend American option greeks... > Testing degenerate finite-differences dividend European option... > Testing degenerate finite-differences dividend American option... > > Tests completed in 25.06 s / Tests completed in 3.55 s > > Testing FDM with barrier option for Heston model vs Black-Scholes model... > Testing FDM with barrier option in Heston model... > Testing FDM with American option in Heston model... > Testing FDM Heston for Ikonen and Toivanen tests... > Testing FDM Heston with Black Scholes model... > Testing FDM with European option with dividends in Heston model... > Testing FDM Heston convergence... > > Tests completed in 3 m 31.86 s / Tests completed in 44.90 s > > Testing indexing of a linear operator... > Testing uniform grid mesher... > Testing application of first-derivatives map... > Testing application of second-derivatives map... > Testing application of second-order mixed-derivatives map... > Testing triple-band map solution... > Testing FDM with barrier option in Heston model... > Testing FDM with American option in Heston model... > Testing FDM with express certificate in Heston model... > Testing FDM with Heston Hull-White model... > Testing bi-conjugated gradient stabilized algorithm with Heston operator... > Testing Crank-Nicolson with initial implicit damping steps for a > digital option... > Testing SparseMatrixReference type... > Testing assignment to zero in sparse matrix... > > Tests completed in 46.73 s / Tests completed in 6.63 s > > ---------------------------------------------------------------------------- > -- HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > _______________________________________________ > QuantLib-dev mailing list > https://lists.sourceforge.net/lists/listinfo/quantlib-dev
------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
In reply to this post by Peter Caspers-4
------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
Yes, the timing might be off. I suspect that the Boost timer is
reporting the total CPU time, that is, the sum of the actual time per each CPU. On my box, if I run the BermudanSwaption example with OpenMP enabled, it outputs: Run completed in 2 m 35 s but if I call it through "time", I get an output like: real 1m19.767s user 2m34.183s sys 0m0.538s that is, total CPU time 2m34s, but real time 1m19s. Being the untrusting individual that I am, I also timed it with a stopwatch. The elapsed time is actually 1m19s :) This said, I still see a little slowdown in the test cases Peter listed. My times are: AmericanOptionTest: disabled 2.4s, enabled 3.4s (real time) AsianOptionTest: disabled 10.6s, enabled 10.4s BarrierOptionTest: disabled 4.9s, enabled 6.1s DividendOptionTest: disabled 5.1s, enabled 6.5s FdHestonTest: disabled 73.4s, enabled 76.8s FdmLinearOpTest: disabled 11.4s, enabled 11.6s Not much, but a bit slower anyway. I've only got 2 CPUs though (and I compiled with -O2). Peter, what do you get on your 8 CPUs if you run the cases via "time"? Luigi On Sun, Jun 15, 2014 at 3:40 PM, Joseph Wang <[hidden email]> wrote: > > That's quite odd since OpenMP should not be causing such huge slowdowns. > > Since by default the items are not complied, I'd rather keep the pragma's > there. > > Also is there any possibilities that the timing code is off? > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > _______________________________________________ > QuantLib-dev mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/quantlib-dev > -- <https://implementingquantlib.blogspot.com> <https://twitter.com/lballabio> ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
Hi,
Have you tried to use only the 4 physical threads of your cpu? I dont use OpenMP but I use boost threads and hyperthreading does very weird things; one is that over 4 threads (in this case) scaling stops being linear, which makes sense. Luigi, your 2 cpus are physical, right? just a shot. Best ----- Original Message ----- > Yes, the timing might be off. I suspect that the Boost timer is > reporting the total CPU time, that is, the sum of the actual time per > each CPU. On my box, if I run the BermudanSwaption example with > OpenMP > enabled, it outputs: > > Run completed in 2 m 35 s > > but if I call it through "time", I get an output like: > > real 1m19.767s > user 2m34.183s > sys 0m0.538s > > that is, total CPU time 2m34s, but real time 1m19s. Being the > untrusting individual that I am, I also timed it with a stopwatch. > The > elapsed time is actually 1m19s :) > > This said, I still see a little slowdown in the test cases Peter > listed. My times are: > > AmericanOptionTest: disabled 2.4s, enabled 3.4s (real time) > AsianOptionTest: disabled 10.6s, enabled 10.4s > BarrierOptionTest: disabled 4.9s, enabled 6.1s > DividendOptionTest: disabled 5.1s, enabled 6.5s > FdHestonTest: disabled 73.4s, enabled 76.8s > FdmLinearOpTest: disabled 11.4s, enabled 11.6s > > Not much, but a bit slower anyway. I've only got 2 CPUs though (and I > compiled with -O2). Peter, what do you get on your 8 CPUs if you run > the cases via "time"? > > Luigi > > > > > > > > On Sun, Jun 15, 2014 at 3:40 PM, Joseph Wang <[hidden email]> > wrote: > > > > That's quite odd since OpenMP should not be causing such huge > > slowdowns. > > > > Since by default the items are not complied, I'd rather keep the > > pragma's > > there. > > > > Also is there any possibilities that the timing code is off? > > > > ------------------------------------------------------------------------------ > > HPCC Systems Open Source Big Data Platform from LexisNexis Risk > > Solutions > > Find What Matters Most in Your Big Data with HPCC Systems > > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > > Leverages Graph Analysis for Fast Processing & Easy Data > > Exploration > > http://p.sf.net/sfu/hpccsystems > > _______________________________________________ > > QuantLib-dev mailing list > > [hidden email] > > https://lists.sourceforge.net/lists/listinfo/quantlib-dev > > > > > > -- > <https://implementingquantlib.blogspot.com> > <https://twitter.com/lballabio> > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk > Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > _______________________________________________ > QuantLib-dev mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/quantlib-dev > ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
oh yes, my timings below are total CPU time rather than wall clock
time ( I usually measure the latter by just counting seconds in my head ... ). That was unfair, sorry ! With the time command I get (for the AmericanOptionTest) g++ -O3 -fopenmp OMP_NUM_THREADS=1 real = 1.925s OMP_NUM_THREADS=2 real = 1.468s OMP_NUM_THREADS=3 real = 1.590s OMP_NUM_THREADS=4 real = 1.647s OMP_NUM_THREADS=5 real = 1.780s OMP_NUM_THREADS=6 real = 1.838s OMP_NUM_THREADS=7 real = 2.081s OMP_NUM_THREADS=8 real = 2.282s g++ -O3 real = 1.638s still, the point is the same imo. WIth 8 cores I'd expect maybe a speed-up factor of 4 to 6. What we instead see is something around 1 (often below 1 as it seems), so effectively all the additional cpu time is eaten up by the overhead for multiple threads. That's not worth it, is it ? I didn't try many optimizations with omp yet, but what I see in "good" cases are the 4-6 above. I wouldn't parallelize for much below. best regards Peter On 15 June 2014 17:18, <[hidden email]> wrote: > Hi, > Have you tried to use only the 4 physical threads of your cpu? I dont use OpenMP but I use boost threads and hyperthreading does very weird things; one is that over 4 threads (in this case) scaling stops being linear, which makes sense. Luigi, your 2 cpus are physical, right? > just a shot. > Best > > > ----- Original Message ----- >> Yes, the timing might be off. I suspect that the Boost timer is >> reporting the total CPU time, that is, the sum of the actual time per >> each CPU. On my box, if I run the BermudanSwaption example with >> OpenMP >> enabled, it outputs: >> >> Run completed in 2 m 35 s >> >> but if I call it through "time", I get an output like: >> >> real 1m19.767s >> user 2m34.183s >> sys 0m0.538s >> >> that is, total CPU time 2m34s, but real time 1m19s. Being the >> untrusting individual that I am, I also timed it with a stopwatch. >> The >> elapsed time is actually 1m19s :) >> >> This said, I still see a little slowdown in the test cases Peter >> listed. My times are: >> >> AmericanOptionTest: disabled 2.4s, enabled 3.4s (real time) >> AsianOptionTest: disabled 10.6s, enabled 10.4s >> BarrierOptionTest: disabled 4.9s, enabled 6.1s >> DividendOptionTest: disabled 5.1s, enabled 6.5s >> FdHestonTest: disabled 73.4s, enabled 76.8s >> FdmLinearOpTest: disabled 11.4s, enabled 11.6s >> >> Not much, but a bit slower anyway. I've only got 2 CPUs though (and I >> compiled with -O2). Peter, what do you get on your 8 CPUs if you run >> the cases via "time"? >> >> Luigi >> >> >> >> >> >> >> >> On Sun, Jun 15, 2014 at 3:40 PM, Joseph Wang <[hidden email]> >> wrote: >> > >> > That's quite odd since OpenMP should not be causing such huge >> > slowdowns. >> > >> > Since by default the items are not complied, I'd rather keep the >> > pragma's >> > there. >> > >> > Also is there any possibilities that the timing code is off? >> > >> > ------------------------------------------------------------------------------ >> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk >> > Solutions >> > Find What Matters Most in Your Big Data with HPCC Systems >> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> > Leverages Graph Analysis for Fast Processing & Easy Data >> > Exploration >> > http://p.sf.net/sfu/hpccsystems >> > _______________________________________________ >> > QuantLib-dev mailing list >> > [hidden email] >> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev >> > >> >> >> >> -- >> <https://implementingquantlib.blogspot.com> >> <https://twitter.com/lballabio> >> >> ------------------------------------------------------------------------------ >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >> Solutions >> Find What Matters Most in Your Big Data with HPCC Systems >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> http://p.sf.net/sfu/hpccsystems >> _______________________________________________ >> QuantLib-dev mailing list >> [hidden email] >> https://lists.sourceforge.net/lists/listinfo/quantlib-dev >> > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > _______________________________________________ > QuantLib-dev mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/quantlib-dev ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
Is there a chance of the test being too small? I remember that many years ago (during my Algorithmics days) we used to make tests as big as the client real portfolio in order to make server buying advices (mainly based on number of processors due to scenarios valuation).
Would the speed-up factors significantly change if the base scenario runs for, lets say, 15 minutes? It would make fixed costs more negligible. Regards, _____________________ Piter Dias > Date: Sun, 15 Jun 2014 20:11:28 +0200 > From: [hidden email] > To: [hidden email] > CC: [hidden email]; [hidden email]; [hidden email] > Subject: Re: [Quantlib-dev] OpenMP - current usage in ql > > oh yes, my timings below are total CPU time rather than wall clock > time ( I usually measure the latter by just counting seconds in my > head ... ). That was unfair, sorry ! With the time command I get (for > the AmericanOptionTest) > > g++ -O3 -fopenmp > > OMP_NUM_THREADS=1 real = 1.925s > OMP_NUM_THREADS=2 real = 1.468s > OMP_NUM_THREADS=3 real = 1.590s > OMP_NUM_THREADS=4 real = 1.647s > OMP_NUM_THREADS=5 real = 1.780s > OMP_NUM_THREADS=6 real = 1.838s > OMP_NUM_THREADS=7 real = 2.081s > OMP_NUM_THREADS=8 real = 2.282s > > g++ -O3 > > real = 1.638s > > still, the point is the same imo. WIth 8 cores I'd expect maybe a > speed-up factor of 4 to 6. What we instead see is something around 1 > (often below 1 as it seems), so effectively all the additional cpu > time is eaten up by the overhead for multiple threads. That's not > worth it, is it ? I didn't try many optimizations with omp yet, but > what I see in "good" cases are the 4-6 above. I wouldn't parallelize > for much below. > > best regards > Peter > > > > > > > > > > > > > > > On 15 June 2014 17:18, <[hidden email]> wrote: > > Hi, > > Have you tried to use only the 4 physical threads of your cpu? I dont use OpenMP but I use boost threads and hyperthreading does very weird things; one is that over 4 threads (in this case) scaling stops being linear, which makes sense. Luigi, your 2 cpus are physical, right? > > just a shot. > > Best > > > > > > ----- Original Message ----- > >> Yes, the timing might be off. I suspect that the Boost timer is > >> reporting the total CPU time, that is, the sum of the actual time per > >> each CPU. On my box, if I run the BermudanSwaption example with > >> OpenMP > >> enabled, it outputs: > >> > >> Run completed in 2 m 35 s > >> > >> but if I call it through "time", I get an output like: > >> > >> real 1m19.767s > >> user 2m34.183s > >> sys 0m0.538s > >> > >> that is, total CPU time 2m34s, but real time 1m19s. Being the > >> untrusting individual that I am, I also timed it with a stopwatch. > >> The > >> elapsed time is actually 1m19s :) > >> > >> This said, I still see a little slowdown in the test cases Peter > >> listed. My times are: > >> > >> AmericanOptionTest: disabled 2.4s, enabled 3.4s (real time) > >> AsianOptionTest: disabled 10.6s, enabled 10.4s > >> BarrierOptionTest: disabled 4.9s, enabled 6.1s > >> DividendOptionTest: disabled 5.1s, enabled 6.5s > >> FdHestonTest: disabled 73.4s, enabled 76.8s > >> FdmLinearOpTest: disabled 11.4s, enabled 11.6s > >> > >> Not much, but a bit slower anyway. I've only got 2 CPUs though (and I > >> compiled with -O2). Peter, what do you get on your 8 CPUs if you run > >> the cases via "time"? > >> > >> Luigi > >> > >> > >> > >> > >> > >> > >> > >> On Sun, Jun 15, 2014 at 3:40 PM, Joseph Wang <[hidden email]> > >> wrote: > >> > > >> > That's quite odd since OpenMP should not be causing such huge > >> > slowdowns. > >> > > >> > Since by default the items are not complied, I'd rather keep the > >> > pragma's > >> > there. > >> > > >> > Also is there any possibilities that the timing code is off? > >> > > >> > ------------------------------------------------------------------------------ > >> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk > >> > Solutions > >> > Find What Matters Most in Your Big Data with HPCC Systems > >> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > >> > Leverages Graph Analysis for Fast Processing & Easy Data > >> > Exploration > >> > http://p.sf.net/sfu/hpccsystems > >> > _______________________________________________ > >> > QuantLib-dev mailing list > >> > [hidden email] > >> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev > >> > > >> > >> > >> > >> -- > >> <https://implementingquantlib.blogspot.com> > >> <https://twitter.com/lballabio> > >> > >> ------------------------------------------------------------------------------ > >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk > >> Solutions > >> Find What Matters Most in Your Big Data with HPCC Systems > >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration > >> http://p.sf.net/sfu/hpccsystems > >> _______________________________________________ > >> QuantLib-dev mailing list > >> [hidden email] > >> https://lists.sourceforge.net/lists/listinfo/quantlib-dev > >> > > > > ------------------------------------------------------------------------------ > > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > > Find What Matters Most in Your Big Data with HPCC Systems > > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > > http://p.sf.net/sfu/hpccsystems > > _______________________________________________ > > QuantLib-dev mailing list > > [hidden email] > > https://lists.sourceforge.net/lists/listinfo/quantlib-dev > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > _______________________________________________ > QuantLib-dev mailing list > [hidden email] > https://lists.sourceforge.net/lists/listinfo/quantlib-dev ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
There's a reason why it's default "off." :-) :-( There's not that much parallelization going on in PDE and Tree. The OpenMP is used mainly for copying arrays so you should see some modest improvements if you are doing very large arrays, but no where close to a factor of N for N cores. The other project I'm working on is that there are a ton of HK people (including my wife) that trade warrants and callable bull-bear certificates, and most of them don't have access to any sort of analytics. The reason for this is that brokers either don't care that their customers have access to analytics, or actually don't want clients with analytics since the brokers are taking the other side of the trade and want their clients to lose money. On Mon, Jun 16, 2014 at 4:20 AM, Piter Dias <[hidden email]> wrote:
------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
Hello > The problem is that the big speedups are for things like monte carlo, > which are "Riduculously parallel." The trouble with MC is that once > you parallelize, the random number generator gives you different > answers, and it becomes impossible to test and also you have the > possibility of very subtle bugs with the RNG coorrelations across > different processors. Trying to get quantlib to have consistent > RNG's in multi-core MC turns out to be a non-trivial project. > I have done this with threads and its not that difficult to avoid those problems. But most (if not all) of the generators in the library are not suitable for it. Not the version of the Mersenne Twister or the distribution generators since most are rejection algorithms. I do get exactly the same result figures with one or N threads. Sobol is ok but is limited for distribution mapping; what I do is to wrap an interface around TINA's MT generator. But you need to link against that. The speed up is practically linear with CPU number at least in the context of the problem I use it for. > Alternatively, the algos that are being used for PDE's in quantlib > particularly ones relating to tridiagonal matrix operations turn out > to be terrible for parallel computing. There could be some better > speedups for PDE's with different algos that are better for parallel > systems. Also if you want to parallelizes, you want to use an > explicit PDE scheme rather than an implicit one. > > The two projects that I can think of are: > > 1) getting MC working for openmp or > 2) putting in better parallel algos for PDE's. > Back in the tokamak dark ages people use to parallelize implicit PDE solvers by breaking the domain and solving those pieces concurrently (rather than working on parallelizing the algebraic solver); all the subtlery was in sticking them together on each time step. That was a long time ago and cant remember the details. Best ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
In reply to this post by Piter Dias-4
Trees seem to fare better. On the BermudanSwaption example, the
elapsed time does halve on two cores (more or less). Peter, what do you get on 4 or 8? Also, there might be other factors that enter the equation (number of cache lines, for example?) Luigi On Sun, Jun 15, 2014 at 10:20 PM, Piter Dias <[hidden email]> wrote: > Is there a chance of the test being too small? I remember that many years > ago (during my Algorithmics days) we used to make tests as big as the client > real portfolio in order to make server buying advices (mainly based on > number of processors due to scenarios valuation). > > Would the speed-up factors significantly change if the base scenario runs > for, lets say, 15 minutes? It would make fixed costs more negligible. > > Regards, > > _____________________ > Piter Dias > [hidden email] > www.piterdias.com > > > >> Date: Sun, 15 Jun 2014 20:11:28 +0200 >> From: [hidden email] >> To: [hidden email] >> CC: [hidden email]; [hidden email]; >> [hidden email] >> Subject: Re: [Quantlib-dev] OpenMP - current usage in ql > >> >> oh yes, my timings below are total CPU time rather than wall clock >> time ( I usually measure the latter by just counting seconds in my >> head ... ). That was unfair, sorry ! With the time command I get (for >> the AmericanOptionTest) >> >> g++ -O3 -fopenmp >> >> OMP_NUM_THREADS=1 real = 1.925s >> OMP_NUM_THREADS=2 real = 1.468s >> OMP_NUM_THREADS=3 real = 1.590s >> OMP_NUM_THREADS=4 real = 1.647s >> OMP_NUM_THREADS=5 real = 1.780s >> OMP_NUM_THREADS=6 real = 1.838s >> OMP_NUM_THREADS=7 real = 2.081s >> OMP_NUM_THREADS=8 real = 2.282s >> >> g++ -O3 >> >> real = 1.638s >> >> still, the point is the same imo. WIth 8 cores I'd expect maybe a >> speed-up factor of 4 to 6. What we instead see is something around 1 >> (often below 1 as it seems), so effectively all the additional cpu >> time is eaten up by the overhead for multiple threads. That's not >> worth it, is it ? I didn't try many optimizations with omp yet, but >> what I see in "good" cases are the 4-6 above. I wouldn't parallelize >> for much below. >> >> best regards >> Peter >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 15 June 2014 17:18, <[hidden email]> wrote: >> > Hi, >> > Have you tried to use only the 4 physical threads of your cpu? I dont >> > use OpenMP but I use boost threads and hyperthreading does very weird >> > things; one is that over 4 threads (in this case) scaling stops being >> > linear, which makes sense. Luigi, your 2 cpus are physical, right? >> > just a shot. >> > Best >> > >> > >> > ----- Original Message ----- >> >> Yes, the timing might be off. I suspect that the Boost timer is >> >> reporting the total CPU time, that is, the sum of the actual time per >> >> each CPU. On my box, if I run the BermudanSwaption example with >> >> OpenMP >> >> enabled, it outputs: >> >> >> >> Run completed in 2 m 35 s >> >> >> >> but if I call it through "time", I get an output like: >> >> >> >> real 1m19.767s >> >> user 2m34.183s >> >> sys 0m0.538s >> >> >> >> that is, total CPU time 2m34s, but real time 1m19s. Being the >> >> untrusting individual that I am, I also timed it with a stopwatch. >> >> The >> >> elapsed time is actually 1m19s :) >> >> >> >> This said, I still see a little slowdown in the test cases Peter >> >> listed. My times are: >> >> >> >> AmericanOptionTest: disabled 2.4s, enabled 3.4s (real time) >> >> AsianOptionTest: disabled 10.6s, enabled 10.4s >> >> BarrierOptionTest: disabled 4.9s, enabled 6.1s >> >> DividendOptionTest: disabled 5.1s, enabled 6.5s >> >> FdHestonTest: disabled 73.4s, enabled 76.8s >> >> FdmLinearOpTest: disabled 11.4s, enabled 11.6s >> >> >> >> Not much, but a bit slower anyway. I've only got 2 CPUs though (and I >> >> compiled with -O2). Peter, what do you get on your 8 CPUs if you run >> >> the cases via "time"? >> >> >> >> Luigi >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Sun, Jun 15, 2014 at 3:40 PM, Joseph Wang <[hidden email]> >> >> wrote: >> >> > >> >> > That's quite odd since OpenMP should not be causing such huge >> >> > slowdowns. >> >> > >> >> > Since by default the items are not complied, I'd rather keep the >> >> > pragma's >> >> > there. >> >> > >> >> > Also is there any possibilities that the timing code is off? >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk >> >> > Solutions >> >> > Find What Matters Most in Your Big Data with HPCC Systems >> >> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> >> > Leverages Graph Analysis for Fast Processing & Easy Data >> >> > Exploration >> >> > http://p.sf.net/sfu/hpccsystems >> >> > _______________________________________________ >> >> > QuantLib-dev mailing list >> >> > [hidden email] >> >> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev >> >> > >> >> >> >> >> >> >> >> -- >> >> <https://implementingquantlib.blogspot.com> >> >> <https://twitter.com/lballabio> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >> >> Solutions >> >> Find What Matters Most in Your Big Data with HPCC Systems >> >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> >> http://p.sf.net/sfu/hpccsystems >> >> _______________________________________________ >> >> QuantLib-dev mailing list >> >> [hidden email] >> >> https://lists.sourceforge.net/lists/listinfo/quantlib-dev >> >> >> > >> > >> > ------------------------------------------------------------------------------ >> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk >> > Solutions >> > Find What Matters Most in Your Big Data with HPCC Systems >> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> > Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> > http://p.sf.net/sfu/hpccsystems >> > _______________________________________________ >> > QuantLib-dev mailing list >> > [hidden email] >> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev >> >> >> ------------------------------------------------------------------------------ >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >> Find What Matters Most in Your Big Data with HPCC Systems >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >> http://p.sf.net/sfu/hpccsystems >> _______________________________________________ >> QuantLib-dev mailing list >> [hidden email] >> https://lists.sourceforge.net/lists/listinfo/quantlib-dev -- <https://implementingquantlib.blogspot.com> <https://twitter.com/lballabio> ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
yes, this example scales well up to 6 cores:
without omp 1m20s threads=2, real=0m42s threads=4, real=0m36s threads=5, real=0m30s threads=6, real=0m27s threads=7, real=0m40s threads=8, real=2m53s Peter On 16 June 2014 09:45, Luigi Ballabio <[hidden email]> wrote: > Trees seem to fare better. On the BermudanSwaption example, the > elapsed time does halve on two cores (more or less). Peter, what do > you get on 4 or 8? > Also, there might be other factors that enter the equation (number of > cache lines, for example?) > > Luigi > > > On Sun, Jun 15, 2014 at 10:20 PM, Piter Dias <[hidden email]> wrote: >> Is there a chance of the test being too small? I remember that many years >> ago (during my Algorithmics days) we used to make tests as big as the client >> real portfolio in order to make server buying advices (mainly based on >> number of processors due to scenarios valuation). >> >> Would the speed-up factors significantly change if the base scenario runs >> for, lets say, 15 minutes? It would make fixed costs more negligible. >> >> Regards, >> >> _____________________ >> Piter Dias >> [hidden email] >> www.piterdias.com >> >> >> >>> Date: Sun, 15 Jun 2014 20:11:28 +0200 >>> From: [hidden email] >>> To: [hidden email] >>> CC: [hidden email]; [hidden email]; >>> [hidden email] >>> Subject: Re: [Quantlib-dev] OpenMP - current usage in ql >> >>> >>> oh yes, my timings below are total CPU time rather than wall clock >>> time ( I usually measure the latter by just counting seconds in my >>> head ... ). That was unfair, sorry ! With the time command I get (for >>> the AmericanOptionTest) >>> >>> g++ -O3 -fopenmp >>> >>> OMP_NUM_THREADS=1 real = 1.925s >>> OMP_NUM_THREADS=2 real = 1.468s >>> OMP_NUM_THREADS=3 real = 1.590s >>> OMP_NUM_THREADS=4 real = 1.647s >>> OMP_NUM_THREADS=5 real = 1.780s >>> OMP_NUM_THREADS=6 real = 1.838s >>> OMP_NUM_THREADS=7 real = 2.081s >>> OMP_NUM_THREADS=8 real = 2.282s >>> >>> g++ -O3 >>> >>> real = 1.638s >>> >>> still, the point is the same imo. WIth 8 cores I'd expect maybe a >>> speed-up factor of 4 to 6. What we instead see is something around 1 >>> (often below 1 as it seems), so effectively all the additional cpu >>> time is eaten up by the overhead for multiple threads. That's not >>> worth it, is it ? I didn't try many optimizations with omp yet, but >>> what I see in "good" cases are the 4-6 above. I wouldn't parallelize >>> for much below. >>> >>> best regards >>> Peter >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On 15 June 2014 17:18, <[hidden email]> wrote: >>> > Hi, >>> > Have you tried to use only the 4 physical threads of your cpu? I dont >>> > use OpenMP but I use boost threads and hyperthreading does very weird >>> > things; one is that over 4 threads (in this case) scaling stops being >>> > linear, which makes sense. Luigi, your 2 cpus are physical, right? >>> > just a shot. >>> > Best >>> > >>> > >>> > ----- Original Message ----- >>> >> Yes, the timing might be off. I suspect that the Boost timer is >>> >> reporting the total CPU time, that is, the sum of the actual time per >>> >> each CPU. On my box, if I run the BermudanSwaption example with >>> >> OpenMP >>> >> enabled, it outputs: >>> >> >>> >> Run completed in 2 m 35 s >>> >> >>> >> but if I call it through "time", I get an output like: >>> >> >>> >> real 1m19.767s >>> >> user 2m34.183s >>> >> sys 0m0.538s >>> >> >>> >> that is, total CPU time 2m34s, but real time 1m19s. Being the >>> >> untrusting individual that I am, I also timed it with a stopwatch. >>> >> The >>> >> elapsed time is actually 1m19s :) >>> >> >>> >> This said, I still see a little slowdown in the test cases Peter >>> >> listed. My times are: >>> >> >>> >> AmericanOptionTest: disabled 2.4s, enabled 3.4s (real time) >>> >> AsianOptionTest: disabled 10.6s, enabled 10.4s >>> >> BarrierOptionTest: disabled 4.9s, enabled 6.1s >>> >> DividendOptionTest: disabled 5.1s, enabled 6.5s >>> >> FdHestonTest: disabled 73.4s, enabled 76.8s >>> >> FdmLinearOpTest: disabled 11.4s, enabled 11.6s >>> >> >>> >> Not much, but a bit slower anyway. I've only got 2 CPUs though (and I >>> >> compiled with -O2). Peter, what do you get on your 8 CPUs if you run >>> >> the cases via "time"? >>> >> >>> >> Luigi >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> On Sun, Jun 15, 2014 at 3:40 PM, Joseph Wang <[hidden email]> >>> >> wrote: >>> >> > >>> >> > That's quite odd since OpenMP should not be causing such huge >>> >> > slowdowns. >>> >> > >>> >> > Since by default the items are not complied, I'd rather keep the >>> >> > pragma's >>> >> > there. >>> >> > >>> >> > Also is there any possibilities that the timing code is off? >>> >> > >>> >> > >>> >> > ------------------------------------------------------------------------------ >>> >> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>> >> > Solutions >>> >> > Find What Matters Most in Your Big Data with HPCC Systems >>> >> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>> >> > Leverages Graph Analysis for Fast Processing & Easy Data >>> >> > Exploration >>> >> > http://p.sf.net/sfu/hpccsystems >>> >> > _______________________________________________ >>> >> > QuantLib-dev mailing list >>> >> > [hidden email] >>> >> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev >>> >> > >>> >> >>> >> >>> >> >>> >> -- >>> >> <https://implementingquantlib.blogspot.com> >>> >> <https://twitter.com/lballabio> >>> >> >>> >> >>> >> ------------------------------------------------------------------------------ >>> >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>> >> Solutions >>> >> Find What Matters Most in Your Big Data with HPCC Systems >>> >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>> >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>> >> http://p.sf.net/sfu/hpccsystems >>> >> _______________________________________________ >>> >> QuantLib-dev mailing list >>> >> [hidden email] >>> >> https://lists.sourceforge.net/lists/listinfo/quantlib-dev >>> >> >>> > >>> > >>> > ------------------------------------------------------------------------------ >>> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>> > Solutions >>> > Find What Matters Most in Your Big Data with HPCC Systems >>> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>> > Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>> > http://p.sf.net/sfu/hpccsystems >>> > _______________________________________________ >>> > QuantLib-dev mailing list >>> > [hidden email] >>> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev >>> >>> >>> ------------------------------------------------------------------------------ >>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >>> Find What Matters Most in Your Big Data with HPCC Systems >>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>> http://p.sf.net/sfu/hpccsystems >>> _______________________________________________ >>> QuantLib-dev mailing list >>> [hidden email] >>> https://lists.sourceforge.net/lists/listinfo/quantlib-dev > > > > -- > <https://implementingquantlib.blogspot.com> > <https://twitter.com/lballabio> ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
Hi,
I did some more tests. First of all I have to correct my numbers below for 7 and 8 threads, when I rerun the example I get ~ 26s for both. I don't know, what went wrong the last time. Anyway I would say in lattice.hpp the parallel for is indeed useful giving a speed-up factor of 4.6 on 8 cores in this example, that's fine. Next I looked at the FDM code again in more detail. In triplebandlinearop.cpp and ninepointlinearop.cpp all omp pragmas are applied to loops of this kind for (Size i=0; i < size; ++i) { diag[i] = y_diag[i]; lower[i] = y_lower[i]; upper[i] = y_upper[i]; } which I guess are already optimized by the compiler on one thread very well and multithreading would only makes sense for very big sizes (playing around with that in toy examples with loops of similar complexity I see a speed up only with loop sizes around 1E+8 and bigger). Indeed disabling the parallel for pragmas in the operator classes does not change the performance of the test cases filtered by --run_test='*/*/*Fd*', see below (only the overhead is avoided, which seems desirable). It seems that these loops are all vectorized by the compiler without having to do anything, because I am getting the same running times when adding #pragma omp simd explicitly. I am not sure about the parallelevolver.hpp and stepcondition.hpp, but at least I don't see any benefit in the Fd test cases or the BermudanSwaption Example (which I think also uses them (?)). all #pragma omp enabled 8 threads real 1m29.793s user 8m15.733s 2 threads real 1m13.676s user 1m56.217s disable triplebandlinearop.cpp 8 threads real 1m31.091s user 6m47.130s 2 threads real 1m15.548s user 1m43.742s disable triplebandlinearop.cpp, ninepointlinearop.cpp 8 threads real 1m18.263s user 1m56.950s 2 threads real 1m15.677suser 1m16.592s disable triplebandlinearop.cpp, ninepointlinearop.cpp, parallelevolver.hpp, stepcondition.hpp real 1m14.468s user 1m11.959s Peter On 16 June 2014 12:52, Peter Caspers <[hidden email]> wrote: > yes, this example scales well up to 6 cores: > > without omp 1m20s > threads=2, real=0m42s > threads=4, real=0m36s > threads=5, real=0m30s > threads=6, real=0m27s > threads=7, real=0m40s > threads=8, real=2m53s > > Peter > > > On 16 June 2014 09:45, Luigi Ballabio <[hidden email]> wrote: >> Trees seem to fare better. On the BermudanSwaption example, the >> elapsed time does halve on two cores (more or less). Peter, what do >> you get on 4 or 8? >> Also, there might be other factors that enter the equation (number of >> cache lines, for example?) >> >> Luigi >> >> >> On Sun, Jun 15, 2014 at 10:20 PM, Piter Dias <[hidden email]> wrote: >>> Is there a chance of the test being too small? I remember that many years >>> ago (during my Algorithmics days) we used to make tests as big as the client >>> real portfolio in order to make server buying advices (mainly based on >>> number of processors due to scenarios valuation). >>> >>> Would the speed-up factors significantly change if the base scenario runs >>> for, lets say, 15 minutes? It would make fixed costs more negligible. >>> >>> Regards, >>> >>> _____________________ >>> Piter Dias >>> [hidden email] >>> www.piterdias.com >>> >>> >>> >>>> Date: Sun, 15 Jun 2014 20:11:28 +0200 >>>> From: [hidden email] >>>> To: [hidden email] >>>> CC: [hidden email]; [hidden email]; >>>> [hidden email] >>>> Subject: Re: [Quantlib-dev] OpenMP - current usage in ql >>> >>>> >>>> oh yes, my timings below are total CPU time rather than wall clock >>>> time ( I usually measure the latter by just counting seconds in my >>>> head ... ). That was unfair, sorry ! With the time command I get (for >>>> the AmericanOptionTest) >>>> >>>> g++ -O3 -fopenmp >>>> >>>> OMP_NUM_THREADS=1 real = 1.925s >>>> OMP_NUM_THREADS=2 real = 1.468s >>>> OMP_NUM_THREADS=3 real = 1.590s >>>> OMP_NUM_THREADS=4 real = 1.647s >>>> OMP_NUM_THREADS=5 real = 1.780s >>>> OMP_NUM_THREADS=6 real = 1.838s >>>> OMP_NUM_THREADS=7 real = 2.081s >>>> OMP_NUM_THREADS=8 real = 2.282s >>>> >>>> g++ -O3 >>>> >>>> real = 1.638s >>>> >>>> still, the point is the same imo. WIth 8 cores I'd expect maybe a >>>> speed-up factor of 4 to 6. What we instead see is something around 1 >>>> (often below 1 as it seems), so effectively all the additional cpu >>>> time is eaten up by the overhead for multiple threads. That's not >>>> worth it, is it ? I didn't try many optimizations with omp yet, but >>>> what I see in "good" cases are the 4-6 above. I wouldn't parallelize >>>> for much below. >>>> >>>> best regards >>>> Peter >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 15 June 2014 17:18, <[hidden email]> wrote: >>>> > Hi, >>>> > Have you tried to use only the 4 physical threads of your cpu? I dont >>>> > use OpenMP but I use boost threads and hyperthreading does very weird >>>> > things; one is that over 4 threads (in this case) scaling stops being >>>> > linear, which makes sense. Luigi, your 2 cpus are physical, right? >>>> > just a shot. >>>> > Best >>>> > >>>> > >>>> > ----- Original Message ----- >>>> >> Yes, the timing might be off. I suspect that the Boost timer is >>>> >> reporting the total CPU time, that is, the sum of the actual time per >>>> >> each CPU. On my box, if I run the BermudanSwaption example with >>>> >> OpenMP >>>> >> enabled, it outputs: >>>> >> >>>> >> Run completed in 2 m 35 s >>>> >> >>>> >> but if I call it through "time", I get an output like: >>>> >> >>>> >> real 1m19.767s >>>> >> user 2m34.183s >>>> >> sys 0m0.538s >>>> >> >>>> >> that is, total CPU time 2m34s, but real time 1m19s. Being the >>>> >> untrusting individual that I am, I also timed it with a stopwatch. >>>> >> The >>>> >> elapsed time is actually 1m19s :) >>>> >> >>>> >> This said, I still see a little slowdown in the test cases Peter >>>> >> listed. My times are: >>>> >> >>>> >> AmericanOptionTest: disabled 2.4s, enabled 3.4s (real time) >>>> >> AsianOptionTest: disabled 10.6s, enabled 10.4s >>>> >> BarrierOptionTest: disabled 4.9s, enabled 6.1s >>>> >> DividendOptionTest: disabled 5.1s, enabled 6.5s >>>> >> FdHestonTest: disabled 73.4s, enabled 76.8s >>>> >> FdmLinearOpTest: disabled 11.4s, enabled 11.6s >>>> >> >>>> >> Not much, but a bit slower anyway. I've only got 2 CPUs though (and I >>>> >> compiled with -O2). Peter, what do you get on your 8 CPUs if you run >>>> >> the cases via "time"? >>>> >> >>>> >> Luigi >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> On Sun, Jun 15, 2014 at 3:40 PM, Joseph Wang <[hidden email]> >>>> >> wrote: >>>> >> > >>>> >> > That's quite odd since OpenMP should not be causing such huge >>>> >> > slowdowns. >>>> >> > >>>> >> > Since by default the items are not complied, I'd rather keep the >>>> >> > pragma's >>>> >> > there. >>>> >> > >>>> >> > Also is there any possibilities that the timing code is off? >>>> >> > >>>> >> > >>>> >> > ------------------------------------------------------------------------------ >>>> >> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>> >> > Solutions >>>> >> > Find What Matters Most in Your Big Data with HPCC Systems >>>> >> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>> >> > Leverages Graph Analysis for Fast Processing & Easy Data >>>> >> > Exploration >>>> >> > http://p.sf.net/sfu/hpccsystems >>>> >> > _______________________________________________ >>>> >> > QuantLib-dev mailing list >>>> >> > [hidden email] >>>> >> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev >>>> >> > >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> <https://implementingquantlib.blogspot.com> >>>> >> <https://twitter.com/lballabio> >>>> >> >>>> >> >>>> >> ------------------------------------------------------------------------------ >>>> >> HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>> >> Solutions >>>> >> Find What Matters Most in Your Big Data with HPCC Systems >>>> >> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>> >> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>> >> http://p.sf.net/sfu/hpccsystems >>>> >> _______________________________________________ >>>> >> QuantLib-dev mailing list >>>> >> [hidden email] >>>> >> https://lists.sourceforge.net/lists/listinfo/quantlib-dev >>>> >> >>>> > >>>> > >>>> > ------------------------------------------------------------------------------ >>>> > HPCC Systems Open Source Big Data Platform from LexisNexis Risk >>>> > Solutions >>>> > Find What Matters Most in Your Big Data with HPCC Systems >>>> > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>> > Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>> > http://p.sf.net/sfu/hpccsystems >>>> > _______________________________________________ >>>> > QuantLib-dev mailing list >>>> > [hidden email] >>>> > https://lists.sourceforge.net/lists/listinfo/quantlib-dev >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions >>>> Find What Matters Most in Your Big Data with HPCC Systems >>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. >>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration >>>> http://p.sf.net/sfu/hpccsystems >>>> _______________________________________________ >>>> QuantLib-dev mailing list >>>> [hidden email] >>>> https://lists.sourceforge.net/lists/listinfo/quantlib-dev >> >> >> >> -- >> <https://implementingquantlib.blogspot.com> >> <https://twitter.com/lballabio> ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
Thanks for the note. I'll take a look at the code some time next week. Since the loops are being vectorized, it looks like there isn't any benefit to parallelization and I'll see about taking the omp pragmas out. On Thu, Jun 19, 2014 at 11:47 PM, Peter Caspers <[hidden email]> wrote: Hi, ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ QuantLib-dev mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-dev |
Free forum by Nabble | Edit this page |