Documentation ¶
Overview ¶
Package cuda provides bindings to the CUDA library.
Building ¶
To use this package, you must tell Go how to link with CUDA. On Mac OS X, this might look like:
export CUDA_PATH="/Developer/NVIDIA/CUDA-8.0" export DYLD_LIBRARY_PATH="$CUDA_PATH/lib":$DYLD_LIBRARY_PATH export CPATH="$CUDA_PATH/include/" export CGO_LDFLAGS="/usr/local/cuda/lib/libcuda.dylib $CUDA_PATH/lib/libcudart.dylib $CUDA_PATH/lib/libcublas.dylib $CUDA_PATH/lib/libcurand.dylib"
On Linux, this might look like:
export CUDA_PATH=/usr/local/cuda export CPATH="$CUDA_PATH/include/" export CGO_LDFLAGS="$CUDA_PATH/lib64/libcublas.so $CUDA_PATH/lib64/libcudart.so $CUDA_PATH/lib64/stubs/libcuda.so $CUDA_PATH/lib64/libcurand.so" export LD_LIBRARY_PATH=$CUDA_PATH/lib64/
Contexts ¶
Virtually every cuda API must be run from within a Context, which can be created like so:
devices, err := cuda.AllDevices() if err != nil { // Handle error. } if len(devices) == 0 { // No devices found. } ctx, err := cuda.NewContext(devices[0], 10) if err != nil { // Handle error. }
To run code in a Context asynchronously, you can do the following:
ctx.Run(func() error { // My code here. })
To run code synchronously, simply read from the resulting channel:
<-ctx.Run(func() error { // My code here. })
You should never call ctx.Run() inside another call to ctx.Run(), for reasons that are documented on the Context.Run() method.
Memory Management ¶
There are two ways to deal with memory: using Buffers, or using an Allocator directly with unsafe.Pointers. The Buffer API provides a high-level buffer interface with garbage collection and bounds checking. Most APIs use Buffers, including the APIs provided by sub-packages.
No matter what, you will need an Allocator if you want to allocate memory. You can create an Allocator directly on top of CUDA:
allocator := cuda.GCAllocator(cuda.NativeAllocator(ctx), 0)
Once you have an allocator, you can use it to allocate Buffer objects like so:
err := <-ctx.Run(func() error { // Allocate 16 bytes. buffer, err := cuda.AllocBuffer(allocator, 16) if err != nil { return err } // Use the buffer here... })
There are various functions to help you deal with buffers. The WriteBuffer() and ReadBuffer() functions allow you to copy Go slices to and from buffers. The Slice() function allows you to get a Buffer which points to a sub-region of a parent Buffer.
Kernels ¶
To run kernels, you will use a Module. You can pass various Go primitives, unsafe.Pointers, and Buffers as kernel arguments.
Sub-packages ¶
The cublas and curand sub-packages provide basic linear algebra routines and random number generators, respectively.
Index ¶
- func ClearBuffer(b Buffer) error
- func CopyBuffer(dst, src Buffer) error
- func MemInfo() (free, total uint64, err error)
- func Overlap(b1, b2 Buffer) bool
- func ReadBuffer(val interface{}, b Buffer) error
- func Synchronize() error
- func WriteBuffer(b Buffer, val interface{}) error
- type Allocator
- type Buffer
- type Context
- type DevAttr
- type Device
- type Error
- type Module
- type Stream
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ClearBuffer ¶
ClearBuffer writes zeros over the contents of a Buffer. It must be called from the correct Context.
func CopyBuffer ¶
CopyBuffer copies as many bytes as possible from src into dst.
The two Buffers must not contain overlapping regions of memory.
func MemInfo ¶
MemInfo gets the free and total amount of memory available for allocation on the current device.
This must be called in a Context.
func ReadBuffer ¶
ReadBuffer reads the data from a Buffer into a slice. This must be called from the correct Context.
See WriteBuffer for details on supported slice types.
func Synchronize ¶
func Synchronize() error
Synchronize waits for asynchronous operations to complete.
This should be called in a Context.
func WriteBuffer ¶
WriteBuffer writes the data from a slice into a Buffer. It must be called from the correct Context.
Supported slice types are:
[]byte []float64 []float32 []int32 []uint32
Similar to the copy() built-in, the maximum possible amount of data will be copied.
Types ¶
type Allocator ¶
type Allocator interface { // Get the Context in which all calls to this Allocator // should be made. // // Unlike Alloc and Free, this needn't be called from the // allocator's Context. Context() *Context // Allocate a chunk of CUDA memory. // // This should only be called from the Context. Alloc(size uintptr) (unsafe.Pointer, error) // Free a chunk of CUDA memory. // // The size passed to Free must be the same size that was // passed to Alloc(). // // This should only be called from the Context. Free(ptr unsafe.Pointer, size uintptr) }
An Allocator allocates and frees CUDA memory.
In general, Allocators are bound to a Context, meaning that they should only be used from within that Context.
Usually, you should prefer to use the Buffer type over a direct memory allocation, since Buffers take care of garbage collection for you.
Allocators are not responsible for zeroing out returned memory.
func BFCAllocator ¶
BFCAllocator creates an Allocator that uses memory coalescing and best-fitting to reduce memory fragmentation.
You should wrap the returned allocator with GCAllocator if you plan to use the Buffer API.
The maxSize argument specifies the maximum amount of memory to claim for the allocator. If it is 0, the allocator may claim nearly all of the available device memory.
If the CUDA_BFC_HEADROOM environment variable is set, it is used as the minimum number of bytes to leave free.
If the CUDA_BFC_MAX environment variable is set, it is used as an upper memory bound (in addition to maxSize).
This should be called from a Context.
func GCAllocator ¶
GCAllocator wraps an Allocator in a new Allocator which automatically triggers garbage collections.
The frac argument behaves similarly to the GOGC environment variable, except that GOGC is a percentage whereas frac is a ratio. Thus, a frac of 1.0 is equivalent to GOGC=100. If frac is 0, the value for GOGC is used.
If you are implementing your own Allocator, you will likely want to wrap it with GCAllocator so that it works nicely with the Buffer API.
This need not be called in a Context.
func NativeAllocator ¶
NativeAllocator returns an Allocator that allocates directly from the CUDA APIs.
The resulting Allocator should be wrapped with GCAllocator if you plan to use it with the Buffer API.
This need not be called in a Context.
type Buffer ¶
type Buffer interface { // Allocator is the Allocator from which the Buffer was // allocated. Allocator() Allocator // Size is the size of the Buffer. Size() uintptr // WithPtr runs f with the pointer contained inside the // Buffer. // During the call to f, it is guaranteed that the Buffer // wil not be garbage collected. // However, nothing should store a reference to ptr after // f has completed. WithPtr(f func(ptr unsafe.Pointer)) }
A Buffer provides a high-level interface into an underlying CUDA buffer.
func AllocBuffer ¶
AllocBuffer allocates a new Buffer.
This must be called in the Allocator's Context.
This does not zero out the returned memory. To do that, you should use ClearBuffer().
func Slice ¶
Slice creates a Buffer which views some part of the contents of another Buffer. The start and end indexes are inclusive and exclusive, respectively.
func WrapPointer ¶
WrapPointer wraps a pointer in a Buffer. You must specify the Allocator from which the pointer originated and the size of the buffer.
After calling this, you should not use the pointer outside of the buffer. The Buffer will automatically free the pointer.
type Context ¶
type Context struct {
// contains filtered or unexported fields
}
A Context maintains a CUDA-dedicated thread. All CUDA code should be run by a Context.
func NewContext ¶
NewContext creates a new Context on the Device.
The bufferSize is the maximum number of asynchronous calls that can be queued up at once. A larger buffer size means that Run() is less likely to block, all else equal.
If bufferSize is -1, then the CUDA_CTX_BUFFER environment variable is used. If bufferSize is -1 and CUDA_CTX_BUFFER is not set, a reasonable default is used.
func (*Context) Run ¶
Run runs f in the Context and returns a channel that will be sent the result of f when f completes.
This may block until some queued up functions have finished running on the Context.
If you are not interested in the result of f, you can simply ignore the returned channel.
While f is running, no other function can run on the Context. This means that, to avoid deadlock, f should not use the Context.
type DevAttr ¶
type DevAttr int
DevAttr is a CUDA device attribute.
const ( DevAttrMaxThreadsPerBlock DevAttr = iota DevAttrMaxBlockDimX DevAttrMaxBlockDimY DevAttrMaxBlockDimZ DevAttrMaxGridDimX DevAttrMaxGridDimY DevAttrMaxGridDimZ DevAttrTotalConstantMemory DevAttrWarpSize DevAttrMaxPitch DevAttrMaxRegistersPerBlock DevAttrRegistersPerBlock DevAttrClockRate DevAttrTextureAlignment DevAttrGPUOverlap DevAttrMultiprocessorCount DevAttrKernelExecTimeout DevAttrIntegrated DevAttrCanMapHostMemory DevAttrComputeMode DevAttrMaximumTexture1DWidth DevAttrMaximumTexture2DWidth DevAttrMaximumTexture2DHeight DevAttrMaximumTexture3DWidth DevAttrMaximumTexture3DHeight DevAttrMaximumTexture3DDepth DevAttrMaximumTexture2DLayeredWidth DevAttrMaximumTexture2DLayeredHeight DevAttrMaximumTexture2DLayeredLayers DevAttrMaximumTexture2DArrayWidth DevAttrMaximumTexture2DArrayHeight DevAttrMaximumTexture2DArrayNumslices DevAttrSurfaceAlignment DevAttrConcurrentKernels DevAttrECCEnabled DevAttrPCIBusID DevAttrPCIDeviceID DevAttrTCCDriver DevAttrMemoryClockRate DevAttrGlobalMemoryBusWidth DevAttrL2CacheSize DevAttrMaxThreadsPerMultiprocessor DevAttrAsyncEngineCount DevAttrUnifiedAddressing DevAttrMaximumTexture1DLayeredWidth DevAttrMaximumTexture1DLayeredLayers DevAttrCanTex2DGather DevAttrMaximumTexture2DGatherWidth DevAttrMaximumTexture2DGatherHeight DevAttrMaximumTexture3DWidthAlternate DevAttrMaximumTexture3DHeightAlternate DevAttrMaximumTexture3DDepthAlternate DevAttrPCIDomainID DevAttrTexturePitchAlignment DevAttrMaximumTexturecubemapWidth DevAttrMaximumTexturecubemapLayeredWidth DevAttrMaximumTexturecubemapLayeredLayers DevAttrMaximumSurface1DWidth DevAttrMaximumSurface2DWidth DevAttrMaximumSurface2DHeight DevAttrMaximumSurface3DWidth DevAttrMaximumSurface3DHeight DevAttrMaximumSurface3DDepth DevAttrMaximumSurface1DLayeredWidth DevAttrMaximumSurface1DLayeredLayers DevAttrMaximumSurface2DLayeredWidth DevAttrMaximumSurface2DLayeredHeight DevAttrMaximumSurface2DLayeredLayers DevAttrMaximumSurfacecubemapWidth DevAttrMaximumSurfacecubemapLayeredWidth DevAttrMaximumSurfacecubemapLayeredLayers DevAttrMaximumTexture1DLinearWidth DevAttrMaximumTexture2DLinearWidth DevAttrMaximumTexture2DLinearHeight DevAttrMaximumTexture2DLinearPitch DevAttrMaximumTexture2DMipmappedWidth DevAttrMaximumTexture2DMipmappedHeight DevAttrComputeCapabilityMajor DevAttrComputeCapabilityMinor DevAttrMaximumTexture1DMipmappedWidth DevAttrStreamPrioritiesSupported DevAttrGlobalL1CacheSupported DevAttrLocalL1CacheSupported DevAttrMaxRegistersPerMultiprocessor DevAttrManagedMemory DevAttrMultiGPUBoard DevAttrMultiGPUBoardGroupID DevAttrHostNativeAtomicSupported DevAttrSingleToDoublePrecisionPerfRatio DevAttrPageableMemoryAccess DevAttrConcurrentManagedAccess DevAttrComputePreemptionSupported DevAttrCanUseHostPointerForRegisteredMem )
All supported device attributes.
type Device ¶
type Device struct {
// contains filtered or unexported fields
}
Device contains a unique ID for a CUDA device.
func AllDevices ¶
AllDevices lists the available CUDA devices.
This needn't be called from a Context.
type Error ¶
type Error struct { // Context is typically a C function name. Context string // Name is the C constant name for the error, // such as "CURAND_STATUS_INTERNAL_ERROR". Name string // Message is the main error message. // // This may be human-readable, although it may often be // the same as Name. Message string }
Error is a CUDA-related error.
type Module ¶
type Module struct {
// contains filtered or unexported fields
}
A Module manages a set of compiled kernels.
func NewModule ¶
NewModule creates a Module by compiling a chunk of PTX code.
This should be called from within the Context.
You can build PTX code using the nvcc compiler like so:
nvcc --gpu-architecture=compute_30 --gpu-code=compute_30 --ptx kernels.cu
In the above example, you build "kernels.cu" to a PTX file called "kernels.ptx".
The word size of the PTX should match the word size of the Go program. Depending on your use case, you may want to compile separate PTX files for 32-bit and 64-bit hosts.
func (*Module) Launch ¶
func (m *Module) Launch(kernel string, gridX, gridY, gridZ, blockX, blockY, blockZ, sharedMem uint, stream *Stream, args ...interface{}) error
Launch launches a kernel (which is referenced by name).
This should be called from within the same Context that NewModule was called from.
Currently, the following types may be used as kernel arguments:
uint int float32 float64 unsafe.Pointer Buffer
To wait for the launched kernel to complete, use Synchronize() or stream.Synchronize() if you specified a non-nil stream.
type Stream ¶
type Stream struct {
// contains filtered or unexported fields
}
A Stream manages a pipeline of CUDA operations. Streams can be employed to achieve parallelism.
func NewStream ¶
NewStream creates a new Stream.
If nonBlocking is true, then this stream will be able to run concurrently with the default stream.
This should be called in a Context.
func NewStreamPriority ¶
NewStreamPriority is like NewStream, but the resulting stream is assigned a certain priority.
This should be called in a Context.
func (*Stream) Close ¶
Close destroys the stream.
This will return immediately, even if the stream is still doing work.
A stream should not be used after it is closed.
This should be called in a Context.
func (*Stream) Pointer ¶
Pointer returns the raw pointer value of the underlying stream object.
If s is nil, then a NULL pointer is returned.
This should be called in a Context.
func (*Stream) Synchronize ¶
Synchronize waits for the stream's tasks to complete.