Alea.cuBase 1.2.680 Released

We are happy to announce that the new Alea.cuBase 1.2.680 is released now. Alea.cuBase 1.2.x has a brand new design, with many new features involved.

  • Alea.cuBase 1.2.680 is now available at NuGet Gallery.
  • Alea.cuBase 1.2.680 is also available at our NuGet feed at http://nuget.aleacubase.com/nuget.
  • Alternatively you can download assemblies from here.
  • An online manual of 1.2.680 is available at here.
  • Samples could be download from here.
  • The release notes could be found here.

To quickly get started:

Alea.cuBase 1.1 Preview 1 (1.1.467) released!

Alea.cuBase 1.1 Preview 1 (1.1.467) is released, you can find it in NuGet Gallery.

This is a preview release, includes some experimental new features:

  • Start to support struct.
  • Start to support graphics interop.

There will be examples and tutorials soon to show how to use these new features.

CUDA scripting inside Excel (Part II)

(tips: to view high quality video, please click the gear button on the video player)

This video shows how to use Alea.cuExtension to quickly code a PI calculation with Monte-Carlo simulation inside Excel with Tsunami F# scripting. We use sobol sequence, because the other parallel randome number generator in Alea.cuExtension are still in developing.

The testing code:

Alea.cuBase now available on NuGet Gallery

Alea.cuBase makes it very easy to quickly start coding CUDA kernels. It does not need an installation of the NVIDIA CUDA toolkit. All you need to start is a CUDA Fermi or later device and the CUDA 5.0 drivers installed.

To install Alea.cuBase, we provide two ways:

  1. Go to downloads section and download the .msi installer file and install.
  2. Or just use the NuGet utility, which I will show you in this post.

Alea.cuBase is available in NuGet Gallery. You can easily install it with the NuGet integrated tool within Visual Studio.

First, if you do not have installed NuGet, you can get it from here. Note that in Visual Studio 2012, it is installed by default, but you might need to upgrade it to newest version.

Start Visual Studio, create an F# application project, right click on the “References” item to popup a context menu, choose “Manage NuGet Packages…”:

2013-02-06_1244

In the package management window, you can search for “alea.cubase”:

2013-02-06_1249

Or search for “cuda”:

2013-02-06_1251

Then, install Alea.cuBase with one click on the “Install” button. While installing, you must agree with the EULA to continue.

After installation, two assemblies will be added to your project automatically:

2013-02-06_1254

Now you can code a simple kernel to test it:

2013-02-06_1257

That’s it, your development environment is ready for CUDA kernel coding already!

Some additional points are worth noting:

  • Without a license installed, it can only work in runtime or evaluation mode. The NuGet package also installs a License Manager in the package tools directory. You can find it in your solution’s packages directory and launch it to install a license if you have purchased one.
  • The package ships with two assemblies: “Alea.CUDA.dll” and “Alea.Interop.dll”, which are required to run your application and must be included for redistribution.
  • The package does not include the manual nor the API reference. To get them, please go to the downloads section on our web page and follow the links there to download the required documentation.

Alea.cuBase 1.0.401 released!

Alea.cuBase 1.0.401 is released, you can find it here.

Alternatively you can get it as a NuGet package.

In this release we change the product name from Alea.CUDA to Alea.cuBase in order conform to existing trademarks. The new name also reflects the fact that Alea.cuBase is a base technology on which you can build your own GPU accelerated .NET applications.

The new release also improves the kernel launch time.

We also moved the tutorial and examples to the public GitHub repository https://github.com/quantalea/Alea.cuSamples.

Install Alea.cuBase license on EC2 machine

Alea.cuBase licenses rely on a fingerprint of the running hardware. For Amazon EC2 cloud machine, the hardware keeps changing. In this case, in order to install an Alea.cuBase license you have to create a Virtual Private Cloud (VPC), then use a Elastic Network Interface (ENI) in that VPC, which retains a static MAC address. In this article, I will show you how to setup a VPC, and start a EC2 instance in that VPC, which has an ENI attached. With this setting, we can generate a meaningful fingerprint, and you can install an Alea.cuBase license.

STEP 1: Create VPC

Login into your AWS console, click “VPC”, then it will bring you to the VPC dashboard. If there are no VPC created, there will be a button “Get started creating a VPC” in your dashboard. Just click it, and a VPC configuration selection window will popup:

2013-01-16_1847

It provides 4 typical network topology. In this example, I choose the first one, which is the simplest one, putting all network to outside world.

We then click “Continue”, it will show you some network settings, what we need to change here is, we need set a proper availability zone for the subnet. Because some zone doesn’t support GPU. So in my example, I choose “us-east-1a”:

2013-01-16_1934

We click “Create VPC”. Then the system will create a set of objects: 1 VPC, 1 Subnet, 1 Internet Gateway, 1 Network ACL, 2 Route Tables, and 1 Security Group. In this example, I keep all default settings, except the security group.

For security group, I need to modify it to let it allow Remote Desktop Protocol (RDP), so that I can connect to it. So just navigate to “Security Groups” in the VPC console. There will need add one Inbound rule for the RDP protocol (You can click on the dropdown list to select the RDP protocol. Then add this rule and apply the changes:

2013-01-16_1939

Now, we finish creating VPC. We can go back to the main console.

STEP 2: Allocate EIP

In order to connect to your EC2 machine from outside, we need allocate one Elastic IP (EIP). To do so, we should turn back to the main console, and select “EC2″, which will bring us to the EC2 console. On the left, we can navigate to “Elastic IPs”, and click “Allocate New Address”. This will popup a dialog, to let you choose the purpose of this EIP. Of course, we need set it to be used in VPC:

2013-01-16_1905

STEP 3: Create ENI

EIP is just a resource, we need create an ENI, and attach that EIP to it. So on the left, navigate to “Network Interfaces”, and click “Create Network Interface”. In the creation dialog, we need select the Subnet to the one that VPC created:

2013-01-16_1942

Now, we’ve created one ENI, but it has only a private IP address, we need attach the EIP to it. To do that, just right click the ENI we’ve created, and a context menu will popup, choose “Associate Address”. A dialog will popup to let you select which IP you want to associate, just select the EIP we’ve just created:

2013-01-16_1944

STEP 4: Start Instance

Now we’ve created all necessary objects, we can launch EC2 instance now. Navigate to “Instance” and click “Launch Instance”, choose “Classic Wizard”, then select proper AMI:

2013-01-16_1917

On the next page, the Instance Type should be “CG1 Cluster GPU” in order to get the Tesla GPU. And, we need to set it to launch in VPC we’ve created:

2013-01-16_1946

The next page, we need to select the Network Interface to the one we’ve just created:

2013-01-16_1947

Also we need to select the security group to be the one we’ve just created:

2013-01-16_1948

After you click “Launch”, it will take 10 to 30 minutes to start the machine.

STEP 5: Install license and test

Now, you can use RDP to connect to the instance. We install Alea.cuBase and use License Manager to install a license:

2013-01-16_2039

It is strongly recommended that you save your license once it is authenticated.

We can run a script from Examples to verify if it works:

2013-01-16_2042

Then no evaluation window popup, the license is installed correctly. And this license is bind to the ENI, so you need keep the ENI as long as you use the license. If you changed the hardware configuration, you then need a re-authentication workflow.

How to use Alea.cuBase in Python

Introduction

Python is often used for scripting and rapid prototyping. In this post we illustrate how we can integrate Alea.cuBase and Python so that we can call GPU algorithms coded with Alea.cuBase conveniently in Python.

In this post we rely on Python for .NET. It provides a nearly seamless integration of Python with the .NET Common Language Runtime (CLR). Note that it does not implement Python as a first-class CLR language, nor does it translate Python code to managed code IL code. It is rather an integration of the C Python engine with the .NET runtime.

An alternative approach would be to use IronPython, which is a is an implementation of the Python programming language targeting the .NET Framework, entirely written in C#. However, because IronPython has some limitations in using very useful Python libraries such as matplotlib, we prefer to work with C Python and Python for .NET.

Setting up the Environment

We suggest that you install the Python tools for Visual Studio from http://pytools.codeplex.com which turn Visual Studio into a nice Python IDE, supporting both CPython and IronPython.

IronPython

If you are going to use IronPython, all that is needed is to install IronPython from http://www.ironpython.net.

Python for .NET

Python for .NET consists of two components:

  • clr.pyd, a Python module interfacing with the .NET world
  • Python.Runtime.dll, an assembly used by clr.pyd

We need to compile Python for .NET to use .NET 4.0 framework and the proper Python versions. Currently, Python for .NET surpported Python version from 2.3 to 2.7 Checkout the source of Python for .NET from https://pythonnet.svn.sourceforge.net/svnroot/pythonnet/trunk.

It contains one solution file for VS 2008. Open it with VS 2010, the conversion will succeed without errors. To compile Python for .NET to use Python 2.7 and .NET 4.0 the following steps are required:

Right-click on project “Python.Runtime” and select “Properties”, select “Application” tab and change the “Target framework” to “.NET Framework 4″. Then open the file pythonnet\pythonnet\src\runtime\buildclrmodule.bat and change the following command:

%windir%\Microsoft.NET\Framework\v2.0.50727\ilasm

to

%windir%\Microsoft.NET\Framework\v4.0.30319\ilasm

Attention, it appears two times. Next, open the file clrmodule.il and change the lines with the version number in the following piece of code:

.assembly extern mscorlib
{
     .publickeytoken = (B7 7A 5C 56 19 34 E0 89 )
     .ver 2:0:0:0
}

to

.assembly extern mscorlib
{
     .publickeytoken = (B7 7A 5C 56 19 34 E0 89 )
     .ver 4:0:0:0
}

To change the Python interpreter version, right-click on project "Python.Runtime" and select "Properties". In the "Build" tab, "Conditional compilation symbols", change "PYTHON26" to "PYTHON27" to select the Python 2.7 interpreter.

The last step is to patch methodbinder.cs. Replace the method MatchParameters with the following code:

private static bool _RetrieveGenericArguments(List<Type> gts, Type pt, Type it)
{
    bool ok = true;
    if (pt.GUID == new Guid())
    {
        gts.Add(it);
    }
    else if (pt.IsGenericType && it.IsGenericType && it.GetGenericTypeDefinition().GUID == pt.GUID)
    {
        var pts = pt.GetGenericArguments();
        var its = it.GetGenericArguments();
        for (int i = 0; i < pts.Length; ++i)
        {
            ok &= _RetrieveGenericArguments(gts, pts[i], its[i]);
        }
    }
    else if (!pt.IsGenericType && !it.IsGenericType && pt.GUID == it.GUID)
    {
        // nothing
    }
    else
    {
        ok = false;
    }
    return ok;
}

internal static MethodInfo MatchParameters(MethodInfo[] mis, Type[] its)
{
    foreach (var mi in mis)
    {
        if (!mi.IsGenericMethodDefinition) continue;

        var pts = (from p in mi.GetParameters() select p.ParameterType).ToArray();
        if (pts.Length != its.Length) continue;

        var n = pts.Length;
        var gts = new List<Type>();
        bool ok = true;
        for (int i = 0; i < n; ++i)
        {
            ok &= _RetrieveGenericArguments(gts, pts[i], its[i]);
        }
        if (!ok) continue;
        if (gts.Count != mi.GetGenericArguments().Length) continue;
        return mi.MakeGenericMethod(gts.ToArray());
    }

    return null;
}

Now recompile the project "Python.Runtime".

After a sucessful build you can test it with the following simple Python script:

import sys

sys.path.append("C:\\dev\\pythonnet\\pythonnet\\src\\runtime\\bin\\Release")

import clr, System

print System.Environment.

# you can also print out the sys.path
print '-----'
for p in sys.path:
    print p
print '-----'

Note that the path C:\dev\pythonnet\pythonnet\src\runtime\bin\Release has to point to the location of the module clr.pyd and the assembly Python.Runtime.dll.

Interfacing Python and .NET

In order to use a private assembly, use clr.AddReference() function. For example to use the assembly "Test.dll" call clr.AddReference("Test") to load it.

We refer to http://ironpython.net/documentation/dotnet/dotnet.html for how to interoperate with .NET from Python.

Preparing a .NET Assembly with GPU Code

We create an F# library project, referencing Alea.CUDA. Make sure that you set the "Copy Local" property of the Alea.CUDA assembly refernce to true. The example below provides a simple kernel adding two arrays and some helper class DeviceWorkerHelper, which exposes some module load functions to get around some limitations of Python for .NET with class extension methods.

module Lib.Test

open Alea.CUDA

let a = [| 1.0; 2.0 |]

let pfunct = cuda {
    let! kernel =
        <@ fun (C:DevicePtr<float>) (A:DevicePtr<float>) (B:DevicePtr<float>) ->
            let tid = threadIdx.x
            C.[tid] <- A.[tid] + B.[tid] @>
        |> defineKernelFunc

    return PFunc(fun (m:Module) (A:float[]) (B:float[]) ->
        let n = A.Length
        use A = m.Worker.Malloc(A)
        use B = m.Worker.Malloc(B)
        use C = m.Worker.Malloc(n)
        let lp = LaunchParam(1, n)
        kernel.Launch m lp C.Ptr A.Ptr B.Ptr
        C.ToHost()) }

type DeviceWorkerHelper(worker:DeviceWorker) =
    member this.LoadPModule(f:PFunc<'T>, m:Builder.PTXModule) = worker.LoadPModule(f, m)
    member this.LoadPModule(fm:PFunc<'T> * Builder.PTXModule) = worker.LoadPModule(fm)
    member this.LoadPModule(f:PFunc<'T>, m:Builder.IRModule) = worker.LoadPModule(f, m)
    member this.LoadPModule(fm:PFunc<'T> * Builder.IRModule) = worker.LoadPModule(fm)
    member this.LoadPModule(t:PTemplate<PFunc<'T>>) = worker.LoadPModule(t)

Calling a GPU Kernel from Python

The following Python script shows how to call the kernel from the Test assembly:

import sys
import clr, System

sys.path.append("C:\\dev\\pythonnet\\pythonnet\\src\\runtime\\bin\\Release")
sys.path.append("..\\Lib\\bin\\Release")

clr.AddReference("Alea.CUDA")
clr.AddReference("Lib")

from Alea.CUDA import Engine, Framework
from Lib import Test

worker = Engine.workers.DefaultWorker
print worker.Name
worker = Test.DeviceWorkerHelper(worker)

A = System.Array[System.Double]([1.0, 2.0, 3.0, 4.0])
B = System.Array[System.Double]([1.5, 2.5, 3.5, 4.5])

def test(pfuncm):
    C = pfuncm.Invoke.Invoke(A).Invoke(B)
    for x in C: print x,
    pfuncm.Dispose()
    print ""

print "Loading into worker"
pfuncm = worker.LoadPModule(Test.pfunct)

print "Invoking GPU kernel"
test(pfuncm)

Executing the script produces the following output:

C:\dev\PythonInterop\PythonScript>Example.py
[0|3.0|GeForce GT 650M|4]
Loading into worker
Invoking GPU kernel
2.5 4.5 6.5 8.5

Unfortunately this script cannot be executed in the Python Interactive inside Visual Studio, because the Python REPL process exits with a StackOverflowException at the import of Alea.CUDA.

Conclusion

We have show how to use Alea.cuBase in Python with a suitable modifiaction of Python for .NET. You can download the example project from here.

If you just want to do rapid prototyping together with some simple plotting and visualisation we suggest that you also take a look at the F# interactive and the FSharpChart library.

Getting Started with Alea.cuBase

In this quick start example, we will code a GPU kernel, which calculates the element-wise sum of two arrays x, y of equal length and puts the result into a third array z.

We set up a grid of 16 blocks, each of 1024 threads. To cover the whole array, each thread handles multiple array elements.

First we use the cuda monad to write a parallel template, which defines all the required GPU resources and a function how these GPU resources are going to be used:

let pfunct = cuda {
    let! kernel =
        <@ fun n (x:DevicePtr<float>)
                 (y:DevicePtr<float>)
                 (z:DevicePtr<float>) ->
            let start = blockIdx.x * blockDim.x + threadIdx.x
            let stride = gridDim.x * blockDim.x
            let mutable i = start
            while i < n do
                z.[i] <- x.[i] + y.[i]
                i <- i + stride @>
        |> defineKernelFunc

let divUp num den = (num + den - 1) / den

return PFunc(fun (m:Module) (x:float[]) (y:float[]) ->
    let kernel = kernel.Apply m
    let n = x.Length
    use dx = m.Worker.Malloc(x)
    use dy = m.Worker.Malloc(y)
    use dz = m.Worker.Malloc<float>(n)
    let lp = LaunchParam(min (divUp n 16) 1024, 16)
    kernel.Launch lp n dx.Ptr dy.Ptr dz.Ptr
    dz.ToHost()) }

The cuda monad produces a value of type:

val pfunct : PTemplate<PFunc<(float [] -> float [] -> float [])>>

Next we need to compile the template. We first generate a worker, which essentially is a CUDA context for a physical GPU, by calling:

let worker = new DeviceWorker(Device(0))

It remains to load the template into the worker, which compiles the CUDA resources into a binary module and loads it into the worker:

use pfuncm = worker.LoadPModule(pfunct)

which produces a value of type:

val pfuncm : PModule<(float [] -> float [] -> float [])>

To execute the kernel we write a small test function:

let test n =
    // Test data and reference solution.
    let x = Array.init n (fun i -> float(i))
    let y = Array.init n (fun i -> float(2*i)) 
    let h = Array.map2 (+) x y

    // Run gpu kernel.
    let d = pfuncm.Invoke x y

    Array.forall2 (fun h d -> h = d) h d

test (1 <<< 24)

It is good practice to clean up your resources:

pfuncm.Dispose()
worker.Dispose()

You have sucessfully launched your first GPU kernel with Alea.cuBase.