Dear All,
so I finally made some progress on integrating CUDA and Quantlib. In particular, I have developed two classes MarketModelEvolverLMMPC_CUDA and SobolCudaBrownianGenerator The first of these does LMM path generation on the GPU in the spot measure using predictor-corrector and can be directly fitted into the MarketModels code. The second generates a MultiDimensional Brownian motion using Sobol and Brownian bridging and the paths can then be used in any application. Typical timing to price 32 caplets, 32 rates, 32 steps, 32 factor model 524272 paths using cuda evolver: 32 seconds, of which 7 seconds on path generation and 6 seconds on transferring paths to CPU using QL evolver : 135 seconds using QL Evolver with SobolCudaBrownianGenerator: 79 seconds. The code can be obtained via the subversion repository at kooderive.sourceforge.net Clearly, much greater speed-ups are possible but will involve more code redesign to avoid transferring complete paths from GPU to CPU. regards Mark -- Assoc Prof Mark Joshi Centre for Actuarial Studies University of Melbourne My website is www.markjoshi.com ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
Hi Mark,
I'm also using a CUDA Sobol generator based on Mike Giles version. IMHO stride in line 94 and 245 stride = gridDim.x * blockDim.x; should be a power of two. blockDim.x is equal to threadsperblock which is 64. I'm also ensuring that dimGrid.x is a power of two by setting dimGrid.x = 1; int n = 1 + 31 / nDimensions; while (n >>= 1) dimGrid.x <<= 1; at line 338ff. (As nDimensions is usually larger than 31 for your problems this might not be an issue for your examples). cheers Klaus On Friday 23 July 2010 06:54:35 Mark joshi wrote: > Dear All, > > so I finally made some progress on integrating CUDA and Quantlib. > > In particular, I have developed two classes > > MarketModelEvolverLMMPC_CUDA > > and > > SobolCudaBrownianGenerator > > The first of these does LMM path generation on the GPU in the spot > measure using predictor-corrector and can be directly fitted into the > MarketModels code. > > The second generates a MultiDimensional Brownian motion using Sobol > and Brownian bridging and the paths can then be used in any application. > > Typical timing to price 32 caplets, > 32 rates, 32 steps, 32 factor model > > 524272 paths > > using cuda evolver: 32 seconds, of which 7 seconds on path generation > and 6 seconds on transferring paths to CPU > > using QL evolver : 135 seconds > > using QL Evolver with SobolCudaBrownianGenerator: 79 seconds. > > The code can be obtained via the subversion repository at > kooderive.sourceforge.net > > Clearly, much greater speed-ups are possible but will involve more > code redesign to avoid transferring complete paths from GPU to CPU. > > regards > > Mark ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
Dear Klaus
thanks for your comments. I haven't done much playing with this aspect since this not where my bottlenecks are at the moment. NB we now have a working skip method in the CUDA Sobol so you can call it multiple times and get the next set of paths in each time. We also skip the first path of zeros. regards Mark On 24 July 2010 09:30, Klaus Spanderen <[hidden email]> wrote: > Hi Mark, > > I'm also using a CUDA Sobol generator based on Mike Giles version. IMHO stride > in line 94 and 245 > > stride = gridDim.x * blockDim.x; > > should be a power of two. blockDim.x is equal to threadsperblock which is 64. > I'm also ensuring that dimGrid.x is a power of two by setting > > dimGrid.x = 1; > int n = 1 + 31 / nDimensions; > while (n >>= 1) dimGrid.x <<= 1; > > at line 338ff. (As nDimensions is usually larger than 31 for your problems > this might not be an issue for your examples). > > cheers > Klaus > > On Friday 23 July 2010 06:54:35 Mark joshi wrote: >> Dear All, >> >> so I finally made some progress on integrating CUDA and Quantlib. >> >> In particular, I have developed two classes >> >> MarketModelEvolverLMMPC_CUDA >> >> and >> >> SobolCudaBrownianGenerator >> >> The first of these does LMM path generation on the GPU in the spot >> measure using predictor-corrector and can be directly fitted into the >> MarketModels code. >> >> The second generates a MultiDimensional Brownian motion using Sobol >> and Brownian bridging and the paths can then be used in any application. >> >> Typical timing to price 32 caplets, >> 32 rates, 32 steps, 32 factor model >> >> 524272 paths >> >> using cuda evolver: 32 seconds, of which 7 seconds on path generation >> and 6 seconds on transferring paths to CPU >> >> using QL evolver : 135 seconds >> >> using QL Evolver with SobolCudaBrownianGenerator: 79 seconds. >> >> The code can be obtained via the subversion repository at >> kooderive.sourceforge.net >> >> Clearly, much greater speed-ups are possible but will involve more >> code redesign to avoid transferring complete paths from GPU to CPU. >> >> regards >> >> Mark > > > -- Assoc Prof Mark Joshi Centre for Actuarial Studies University of Melbourne My website is www.markjoshi.com ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
Dear Mark,
Dear QuantLib users, The Kooderive project looks very interesting. I curious about specs of GPU/CPU the tests were run. What kind of floating point numbers were used, double or single? Currently I am working on multicore implementation of LMM too. One problem with many factor models is that, Longstaff-Schwarz method requires huge amount of memory to store paths. I am considering to use floats to save space and for better speedup with SSE2. But not sure about numerical error. Do you thing that single precision accuracy is enough in most of cases? How would Longstaff-Schwarz be implemented on GPU? Is there any scalable implementation of the regression part? Also, why cuda evolver needs to transfer path to CPU? Can't GPU price all paths and simply return the results? With cuda evolver 6 seconds were used for transferring paths. Does it mean that without transferring path the total time can be reduced from 32 to 26 seconds? Regards, Kakhkhor Abdijalilov. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
In reply to this post by Mark joshi-2
Dear Kakhkhor,
all tests were done with floats on the GPU -- these seem to be more than accurate enough. I have not yet addressed the issues with LS on the GPU! The project is incremental. Phase 1 was do the paths on the GPU and the rest on the CPU, hence I had to transfer the paths to the CPU. The main advantage of this approach is that you get all the QuantLib functionality. Phase 2 will be to price on the GPU and will not require path transfer. I am hoping that the total time for this will be less than 10 seconds! I am about to test the code for this. The main trickiness was in how to define the product in a sufficiently generic way that it was not necessary to do a huge amount of recoding for every new product. But I think I now have a reasonable solution. Phase 3 will be to look at LS on the GPU! I haven't addressed the problems of parallel regression yet. Very happy to discuss, however, either here, or one-to-one. regards Mark /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// Dear Mark, Dear QuantLib users, The Kooderive project looks very interesting. I curious about specs of GPU/CPU the tests were run. What kind of floating point numbers were used, double or single? Currently I am working on multicore implementation of LMM too. One problem with many factor models is that, Longstaff-Schwarz method requires huge amount of memory to store paths. I am considering to use floats to save space and for better speedup with SSE2. But not sure about numerical error. Do you thing that single precision accuracy is enough in most of cases? How would Longstaff-Schwarz be implemented on GPU? Is there any scalable implementation of the regression part? Also, why cuda evolver needs to transfer path to CPU? Can't GPU price all paths and simply return the results? With cuda evolver 6 seconds were used for transferring paths. Does it mean that without transferring path the total time can be reduced from 32 to 26 seconds? Regards, Kakhkhor Abdijalilov. -- Assoc Prof Mark Joshi Centre for Actuarial Studies University of Melbourne My website is www.markjoshi.com ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
Dear Mark,
I did some work on parallel implementation of LS. The results will be made public after the testing is completed (if everything works well, hopefully). I just read "GRAPHICAL ASIAN OPTIONS" paper and have several questions. In section 5 where the numerical results are discussed, 32768 path were used to price the option at each volatility level and the target price was computed on CPU using 2^22 path. But how the control variate was used? Was the Sobol sequence randomized somehow? It is surprising to see that with only 32768 path QMC results were so accurate. I was under the impression that QMC looses its efficiency at higher dimensions. Could it be because Asian payoff becomes less volatile as the number of averaging dates increases? Also, how CPU performance was measured? Did it use QuantLib's Asian pricing engine? Regards, Kakhkhor Abdijalilov. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
Dear Kakhkhor,
>It is surprising to see that with only 32768 path QMC results were so accurate. I was under the impression that QMC looses its efficiency at higher dimensions. The notion that "QMC looses its efficiency at higher dimensions" is incorrect. For majority of financial problems QMC ( with Sobol' sequences ) will be much more efficient than MC ( in roughly sqrt(N) times where N is the number of sampled points/paths ) regardless of nominal dimensions because financial problems have low effective dimensions. For details see e.g. http://www.broda.co.uk/gsa/wilmott_GSA_SK.pdf , http://www.broda.co.uk/gsa/qmc_finance.pdf Regards, Sergei Kucherenko ________________________________________ From: [hidden email] [[hidden email]] On Behalf Of Kakhkhor Abdijalilov [[hidden email]] Sent: Monday, August 09, 2010 11:25 AM To: [hidden email] Subject: Re: [Quantlib-users] Kooql Dear Mark, I did some work on parallel implementation of LS. The results will be made public after the testing is completed (if everything works well, hopefully). I just read "GRAPHICAL ASIAN OPTIONS" paper and have several questions. In section 5 where the numerical results are discussed, 32768 path were used to price the option at each volatility level and the target price was computed on CPU using 2^22 path. But how the control variate was used? Was the Sobol sequence randomized somehow? It is surprising to see that with only 32768 path QMC results were so accurate. I was under the impression that QMC looses its efficiency at higher dimensions. Could it be because Asian payoff becomes less volatile as the number of averaging dates increases? Also, how CPU performance was measured? Did it use QuantLib's Asian pricing engine? Regards, Kakhkhor Abdijalilov. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
In reply to this post by Kakhkhor Abdijalilov
2010/8/9 Kakhkhor Abdijalilov <[hidden email]>:
> Could it be because Asian payoff becomes less > volatile as the number of averaging dates increases? Correct me if I'm wrong but an asian payoff DOES become less volatile as the number of averaging dates increases. M. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
In reply to this post by Mark joshi-2
To use the control variate, you simply price the arithmetic minus the
geometric, and add the analytic price at the end. Sobol numbers work well in high dimensions if used properly. The code for the Asian stuff is in Kooderive. I wrote my own Asian pricer for the CPU. regards Mark -- Assoc Prof Mark Joshi Centre for Actuarial Studies University of Melbourne My website is www.markjoshi.com ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
In reply to this post by Kucherenko, Sergei
Dear Sergei,
Thank you for the very illuminating article. Indeed, after reading some MC book I was under the impression that LD sequences loose their advantage in higher dimensions. I see now, that is not the case. Btw, does anyone use Ziggurat method? It generates Gaussian variates directly from uniform integers and avoids using inverse CDF. In my tests with of Asian option pricing engines Ziggurat run about 40% faster compared to the inverse CDF method. In those tests I used GBM with exact sampling, and thus needed to take exp or log at every time step. If Euler is used (when exact sampling is not possible), Ziggurat may fair even better (no need to take exp or log at every time step). In models with simple approximation schemes such as Euler, the overall weight of CDF inversion in the total work could be significant. Ziggurat method should be good on GPU as well, because it needs only 1 floating point multiplication to convert a random int into a standard Gaussian variate. This leaves less room for numerical errors associated with single precision numbers. Ziggurat is an acceptance/rejection method and theoretically it shouldn't be used with low discrepancy sequences. But the rejection probability is very small and in about 99% cases Ziggurat simply multiplies random ints by normalization factors from the lookup table. This way LD property should be mostly preserved. But empirical tests are needed to see how good it is. I have c++ implementation of Ziggurat/Sobol normal variate generator. If anyone wants to try it out, PM me. I also implemented re-entrant multithreaded Sobol generator using Joe and Kuo direction integers with max dimension 21201. Re-entrant means that it can be used simultaneously by many threads. It works with OpenMP as well. The synchronization is done internally using atomic operations and hidden from the user. It is as fast as QuantLib's implementation and has virtually zero synchronization overhead. If anyone wants to try it out, please PM me. Regards, Kakhkhor Abdijalilov. ------------------------------------------------------------------------------ This SF.net email is sponsored by Make an app they can't live without Enter the BlackBerry Developer Challenge http://p.sf.net/sfu/RIM-dev2dev _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
In reply to this post by Mark joshi-2
Dear All
update on kooql Ok i've made more progress. The cash-flow generation and discounting are now done on the GPU. I have also got the code to work with 2 GPUs. Current timings: 1 million paths, 32 rates 32 steps 5 factors time to compute price 0.7 seconds with 2 gpus 1 second with 1 gpu. Rough time with QuantLib market model code: 170 seconds. so speed up is about 240x and 170x. The cash-flow generation is templatized on the product so is fairly generic. Hardware is one Quadro FX5800 and one Tesla C1060. (thank you NVIDIA!) regards Mark On 9 August 2010 11:43, Mark joshi <[hidden email]> wrote: > Dear Kakhkhor, > > all tests were done with floats on the GPU -- these seem to be more > than accurate enough. > > I have not yet addressed the issues with LS on the GPU! > > The project is incremental. Phase 1 was do the paths on the GPU and > the rest on the CPU, > hence I had to transfer the paths to the CPU. The main advantage of > this approach is > that you get all the QuantLib functionality. > > Phase 2 will be to price on the GPU and will not require path > transfer. I am hoping that > the total time for this will be less than 10 seconds! I am about to > test the code for this. > The main trickiness was in how to define the product in a sufficiently > generic way that > it was not necessary to do a huge amount of recoding for every new > product. But I think > I now have a reasonable solution. > > Phase 3 will be to look at LS on the GPU! I haven't addressed the problems of > parallel regression yet. Very happy to discuss, however, either here, > or one-to-one. > > regards > > Mark > > > > > > > /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// > Dear Mark, > Dear QuantLib users, > The Kooderive project looks very interesting. I curious about specs of > GPU/CPU the tests were run. What kind of floating point numbers were > used, double or single? > > Currently I am working on multicore implementation of LMM too. One > problem with many factor models is that, Longstaff-Schwarz method > requires huge amount of memory to store paths. I am considering to use > floats to save space and for better speedup with SSE2. But not sure > about numerical error. Do you thing that single precision accuracy is > enough in most of cases? > > How would Longstaff-Schwarz be implemented on GPU? Is there any > scalable implementation of the regression part? > > Also, why cuda evolver needs to transfer path to CPU? Can't GPU price > all paths and simply return the results? > > With cuda evolver 6 seconds were used for transferring paths. Does it > mean that without transferring path the total time can be reduced from > 32 to 26 seconds? > > Regards, > Kakhkhor Abdijalilov. > > > > > -- > > > Assoc Prof Mark Joshi > Centre for Actuarial Studies > University of Melbourne > My website is www.markjoshi.com > -- Assoc Prof Mark Joshi Centre for Actuarial Studies University of Melbourne My website is www.markjoshi.com ------------------------------------------------------------------------------ This SF.net Dev2Dev email is sponsored by: Show off your parallel programming skills. Enter the Intel(R) Threading Challenge 2010. http://p.sf.net/sfu/intel-thread-sfd _______________________________________________ QuantLib-users mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/quantlib-users |
Free forum by Nabble | Edit this page |