Kooql

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Kooql

Mark joshi-2
Dear All,

so I finally made some progress on integrating CUDA and Quantlib.

In particular, I have developed two classes

MarketModelEvolverLMMPC_CUDA

and

SobolCudaBrownianGenerator

The first of these does LMM path generation on the GPU in the spot
measure using predictor-corrector and can be directly fitted into the
MarketModels code.

The second generates a MultiDimensional Brownian motion using Sobol
and Brownian bridging and the paths can then be used in any application.

Typical timing to price 32 caplets,
32 rates, 32 steps, 32 factor model

524272 paths

using cuda evolver: 32 seconds, of which 7 seconds on path generation
and 6 seconds on transferring paths to CPU

using QL evolver : 135 seconds

using QL Evolver with SobolCudaBrownianGenerator: 79 seconds.

The code can be obtained via the subversion repository at
kooderive.sourceforge.net

Clearly, much greater speed-ups are possible but will involve more
code redesign to avoid transferring complete paths from GPU to CPU.

regards

Mark



--


Assoc Prof Mark Joshi
Centre for Actuarial Studies
University of Melbourne
My website is www.markjoshi.com

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Klaus Spanderen-2
Hi Mark,

I'm also using a CUDA Sobol generator based on Mike Giles version. IMHO stride
in line 94 and 245

 stride = gridDim.x * blockDim.x;

should be a power of two.  blockDim.x is equal to threadsperblock which is 64.
I'm also ensuring that dimGrid.x is a power of two by setting

        dimGrid.x = 1;
        int n = 1 + 31 / nDimensions;
        while (n >>= 1) dimGrid.x <<= 1;

at line 338ff. (As nDimensions is usually larger than 31 for your problems
this might not be an issue for your examples).

cheers
 Klaus

On Friday 23 July 2010 06:54:35 Mark joshi wrote:

> Dear All,
>
> so I finally made some progress on integrating CUDA and Quantlib.
>
> In particular, I have developed two classes
>
> MarketModelEvolverLMMPC_CUDA
>
> and
>
> SobolCudaBrownianGenerator
>
> The first of these does LMM path generation on the GPU in the spot
> measure using predictor-corrector and can be directly fitted into the
> MarketModels code.
>
> The second generates a MultiDimensional Brownian motion using Sobol
> and Brownian bridging and the paths can then be used in any application.
>
> Typical timing to price 32 caplets,
> 32 rates, 32 steps, 32 factor model
>
> 524272 paths
>
> using cuda evolver: 32 seconds, of which 7 seconds on path generation
> and 6 seconds on transferring paths to CPU
>
> using QL evolver : 135 seconds
>
> using QL Evolver with SobolCudaBrownianGenerator: 79 seconds.
>
> The code can be obtained via the subversion repository at
> kooderive.sourceforge.net
>
> Clearly, much greater speed-ups are possible but will involve more
> code redesign to avoid transferring complete paths from GPU to CPU.
>
> regards
>
> Mark



------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Mark joshi-2
Dear Klaus

thanks for your comments. I haven't done much playing with this aspect
since this not where my bottlenecks are at the moment.

NB we now have a working skip method in the CUDA Sobol so you can call
it multiple times and get the next set of paths in each time. We also
skip the first path of zeros.

regards

Mark


On 24 July 2010 09:30, Klaus Spanderen <[hidden email]> wrote:

> Hi Mark,
>
> I'm also using a CUDA Sobol generator based on Mike Giles version. IMHO stride
> in line 94 and 245
>
>  stride = gridDim.x * blockDim.x;
>
> should be a power of two.  blockDim.x is equal to threadsperblock which is 64.
> I'm also ensuring that dimGrid.x is a power of two by setting
>
>        dimGrid.x = 1;
>        int n = 1 + 31 / nDimensions;
>        while (n >>= 1) dimGrid.x <<= 1;
>
> at line 338ff. (As nDimensions is usually larger than 31 for your problems
> this might not be an issue for your examples).
>
> cheers
>  Klaus
>
> On Friday 23 July 2010 06:54:35 Mark joshi wrote:
>> Dear All,
>>
>> so I finally made some progress on integrating CUDA and Quantlib.
>>
>> In particular, I have developed two classes
>>
>> MarketModelEvolverLMMPC_CUDA
>>
>> and
>>
>> SobolCudaBrownianGenerator
>>
>> The first of these does LMM path generation on the GPU in the spot
>> measure using predictor-corrector and can be directly fitted into the
>> MarketModels code.
>>
>> The second generates a MultiDimensional Brownian motion using Sobol
>> and Brownian bridging and the paths can then be used in any application.
>>
>> Typical timing to price 32 caplets,
>> 32 rates, 32 steps, 32 factor model
>>
>> 524272 paths
>>
>> using cuda evolver: 32 seconds, of which 7 seconds on path generation
>> and 6 seconds on transferring paths to CPU
>>
>> using QL evolver : 135 seconds
>>
>> using QL Evolver with SobolCudaBrownianGenerator: 79 seconds.
>>
>> The code can be obtained via the subversion repository at
>> kooderive.sourceforge.net
>>
>> Clearly, much greater speed-ups are possible but will involve more
>> code redesign to avoid transferring complete paths from GPU to CPU.
>>
>> regards
>>
>> Mark
>
>
>



--


Assoc Prof Mark Joshi
Centre for Actuarial Studies
University of Melbourne
My website is www.markjoshi.com

------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Kakhkhor Abdijalilov
Dear Mark,
Dear QuantLib users,
The Kooderive project looks very interesting. I curious about specs of
GPU/CPU the tests were run. What kind of floating point numbers were
used, double or single?

Currently I am working on multicore implementation of LMM too. One
problem with many factor models is that, Longstaff-Schwarz method
requires huge amount of memory to store paths. I am considering to use
floats to save space and for better speedup with  SSE2. But not sure
about numerical error. Do you thing that single precision accuracy is
enough in most of cases?

How would Longstaff-Schwarz be implemented on GPU? Is there any
scalable implementation of the regression part?

Also, why cuda evolver needs to transfer path to CPU? Can't GPU price
all paths and simply return the results?

With cuda evolver 6 seconds were used for transferring paths. Does it
mean that without transferring path the total time can be reduced from
32 to 26 seconds?

Regards,
Kakhkhor Abdijalilov.

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Mark joshi-2
In reply to this post by Mark joshi-2
Dear Kakhkhor,

all tests were done with floats on the GPU -- these seem to be more
than accurate enough.

I have not yet addressed the issues with LS on the GPU!

The project is incremental. Phase 1 was do the paths on the GPU and
the rest on the CPU,
hence I had to transfer the paths to the CPU. The main advantage of
this approach is
that you get all the QuantLib functionality.

Phase 2 will be to price on the GPU and will not require path
transfer. I am hoping that
the total time for this will be less than 10 seconds! I am about to
test the code for this.
The main trickiness was in how to define the product in a sufficiently
generic way that
it was not necessary to do a huge amount of recoding for every new
product. But I think
I now have a reasonable solution.

Phase 3 will be to look at LS on the GPU! I haven't addressed the problems of
parallel regression yet. Very happy to discuss, however, either here,
or one-to-one.

regards

Mark






///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Dear Mark,
Dear QuantLib users,
The Kooderive project looks very interesting. I curious about specs of
GPU/CPU the tests were run. What kind of floating point numbers were
used, double or single?

Currently I am working on multicore implementation of LMM too. One
problem with many factor models is that, Longstaff-Schwarz method
requires huge amount of memory to store paths. I am considering to use
floats to save space and for better speedup with  SSE2. But not sure
about numerical error. Do you thing that single precision accuracy is
enough in most of cases?

How would Longstaff-Schwarz be implemented on GPU? Is there any
scalable implementation of the regression part?

Also, why cuda evolver needs to transfer path to CPU? Can't GPU price
all paths and simply return the results?

With cuda evolver 6 seconds were used for transferring paths. Does it
mean that without transferring path the total time can be reduced from
32 to 26 seconds?

Regards,
Kakhkhor Abdijalilov.




--


Assoc Prof Mark Joshi
Centre for Actuarial Studies
University of Melbourne
My website is www.markjoshi.com

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Kakhkhor Abdijalilov
Dear Mark,
I did some work on parallel implementation of LS. The results will be
made public after the testing is completed (if everything works well,
hopefully).

I just read "GRAPHICAL ASIAN OPTIONS" paper and have several questions.

In section 5 where the numerical results are discussed, 32768 path
were used to price the option at each volatility level and the target
price was computed on CPU using 2^22 path. But how the control variate
was used? Was the Sobol sequence randomized somehow?

It is surprising to see that with only 32768 path QMC results were so
accurate. I was under the impression that QMC looses its efficiency at
higher dimensions. Could it be because Asian payoff becomes less
volatile as the number of averaging dates increases?  Also, how CPU
performance was measured? Did it use QuantLib's Asian pricing engine?

Regards,
Kakhkhor Abdijalilov.

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Kucherenko, Sergei
Dear Kakhkhor,

>It is surprising to see that with only 32768 path QMC results were so accurate. I was under the impression that QMC looses its efficiency at higher dimensions.
The notion that "QMC looses its efficiency at higher dimensions" is incorrect. For majority of financial problems QMC ( with Sobol' sequences ) will be much more efficient than MC ( in roughly sqrt(N) times where N is the number of sampled points/paths ) regardless of nominal dimensions because financial problems have low effective dimensions. For details see e.g.  http://www.broda.co.uk/gsa/wilmott_GSA_SK.pdf , http://www.broda.co.uk/gsa/qmc_finance.pdf

Regards,
Sergei Kucherenko
________________________________________
From: [hidden email] [[hidden email]] On Behalf Of Kakhkhor Abdijalilov [[hidden email]]
Sent: Monday, August 09, 2010 11:25 AM
To: [hidden email]
Subject: Re: [Quantlib-users] Kooql

Dear Mark,
I did some work on parallel implementation of LS. The results will be
made public after the testing is completed (if everything works well,
hopefully).

I just read "GRAPHICAL ASIAN OPTIONS" paper and have several questions.

In section 5 where the numerical results are discussed, 32768 path
were used to price the option at each volatility level and the target
price was computed on CPU using 2^22 path. But how the control variate
was used? Was the Sobol sequence randomized somehow?

It is surprising to see that with only 32768 path QMC results were so
accurate. I was under the impression that QMC looses its efficiency at
higher dimensions. Could it be because Asian payoff becomes less
volatile as the number of averaging dates increases?  Also, how CPU
performance was measured? Did it use QuantLib's Asian pricing engine?

Regards,
Kakhkhor Abdijalilov.

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Marcin Pawlik
In reply to this post by Kakhkhor Abdijalilov
2010/8/9 Kakhkhor Abdijalilov <[hidden email]>:

> Could it be because Asian payoff becomes less
> volatile as the number of averaging dates increases?

Correct me if I'm wrong but an asian payoff DOES become less volatile
as the number of averaging dates increases.
M.

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Mark joshi-2
In reply to this post by Mark joshi-2
To use the control variate, you simply price the arithmetic minus the
geometric, and add the analytic price at the end.

Sobol numbers work well in high dimensions if used properly.

The code for the Asian stuff is in Kooderive. I wrote my own Asian
pricer for the CPU.

regards

Mark


--


Assoc Prof Mark Joshi
Centre for Actuarial Studies
University of Melbourne
My website is www.markjoshi.com

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Kakhkhor Abdijalilov
In reply to this post by Kucherenko, Sergei
Dear Sergei,
Thank you for the very illuminating article. Indeed, after reading
some MC book I was under the impression that LD sequences loose their
advantage in higher dimensions. I see now, that is not the case.

Btw, does anyone use Ziggurat method? It generates Gaussian variates
directly from uniform integers and avoids using inverse CDF. In my
tests with of Asian option pricing engines Ziggurat run about 40%
faster compared to the inverse CDF method. In those tests I used GBM
with exact sampling, and thus needed to take exp or log at every time
step. If Euler is used (when exact sampling is not possible), Ziggurat
may fair even better (no need to take exp or log at every time step).
In models with simple approximation schemes such as Euler, the overall
weight of CDF inversion in the total work could be significant.

Ziggurat method should be good on GPU as well, because it needs only 1
floating point multiplication to convert a random int into a standard
Gaussian variate. This leaves less room for numerical errors
associated with single precision numbers.

Ziggurat is an acceptance/rejection method and theoretically it
shouldn't be used  with low discrepancy sequences. But the rejection
probability is very small and in about 99% cases Ziggurat simply
multiplies random ints by normalization factors from the lookup table.
This way LD property should be mostly preserved. But empirical tests
are needed to see how good it is. I have c++ implementation of
Ziggurat/Sobol normal variate generator. If anyone wants to try it
out, PM me.

I also implemented re-entrant multithreaded Sobol generator using Joe
and Kuo direction integers with max dimension 21201. Re-entrant means
that it can be used simultaneously by many threads. It works with
OpenMP as well. The synchronization is done internally using atomic
operations and hidden from the user. It is as fast as QuantLib's
implementation and has virtually zero synchronization overhead. If
anyone wants to try it out, please PM me.

Regards,
Kakhkhor Abdijalilov.

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users
Reply | Threaded
Open this post in threaded view
|

Re: Kooql

Mark joshi-2
In reply to this post by Mark joshi-2
Dear All

update on kooql

Ok i've made more progress.

The cash-flow generation and discounting are now done on the GPU. I
have also got the code to work with 2 GPUs. Current timings:

1 million paths,
32 rates
32 steps
5 factors
time to compute price
0.7 seconds with 2 gpus
1 second with 1 gpu.

Rough time with QuantLib market model code: 170 seconds.
so speed up is about 240x and 170x.

The cash-flow generation is templatized on the product so is fairly generic.

Hardware is one Quadro FX5800 and one Tesla C1060. (thank you NVIDIA!)

regards

Mark


On 9 August 2010 11:43, Mark joshi <[hidden email]> wrote:

> Dear Kakhkhor,
>
> all tests were done with floats on the GPU -- these seem to be more
> than accurate enough.
>
> I have not yet addressed the issues with LS on the GPU!
>
> The project is incremental. Phase 1 was do the paths on the GPU and
> the rest on the CPU,
> hence I had to transfer the paths to the CPU. The main advantage of
> this approach is
> that you get all the QuantLib functionality.
>
> Phase 2 will be to price on the GPU and will not require path
> transfer. I am hoping that
> the total time for this will be less than 10 seconds! I am about to
> test the code for this.
> The main trickiness was in how to define the product in a sufficiently
> generic way that
> it was not necessary to do a huge amount of recoding for every new
> product. But I think
> I now have a reasonable solution.
>
> Phase 3 will be to look at LS on the GPU! I haven't addressed the problems of
> parallel regression yet. Very happy to discuss, however, either here,
> or one-to-one.
>
> regards
>
> Mark
>
>
>
>
>
>
> ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
> Dear Mark,
> Dear QuantLib users,
> The Kooderive project looks very interesting. I curious about specs of
> GPU/CPU the tests were run. What kind of floating point numbers were
> used, double or single?
>
> Currently I am working on multicore implementation of LMM too. One
> problem with many factor models is that, Longstaff-Schwarz method
> requires huge amount of memory to store paths. I am considering to use
> floats to save space and for better speedup with  SSE2. But not sure
> about numerical error. Do you thing that single precision accuracy is
> enough in most of cases?
>
> How would Longstaff-Schwarz be implemented on GPU? Is there any
> scalable implementation of the regression part?
>
> Also, why cuda evolver needs to transfer path to CPU? Can't GPU price
> all paths and simply return the results?
>
> With cuda evolver 6 seconds were used for transferring paths. Does it
> mean that without transferring path the total time can be reduced from
> 32 to 26 seconds?
>
> Regards,
> Kakhkhor Abdijalilov.
>
>
>
>
> --
>
>
> Assoc Prof Mark Joshi
> Centre for Actuarial Studies
> University of Melbourne
> My website is www.markjoshi.com
>



--


Assoc Prof Mark Joshi
Centre for Actuarial Studies
University of Melbourne
My website is www.markjoshi.com

------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
QuantLib-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/quantlib-users