OpenCL Essential Course

Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.

OpenCL™ is a young technology, and, while a specification has been published (, there are currently few documents that provide a basic introduction with examples. This article helps make OpenCL™ easier to understand and implement.

Note that:

I work at AMD, and, as such, I will test all example code on our implementation for both Windows® and Linux®; however, my intention is to illustrate the use of OpenCL™ regardless of platform. All examples are written in pure OpenCL™ and should run equally well on any implementation.
I have done my best to provide examples that work out-of-the-box on non-AMD implementations of OpenCL™, but I will not be testing them on non-AMD implementations; therefore, it is possible that an example might not work as expected on such systems. If this is the case, please let me know via our OpenCL™ forum, and I will do my best to rectify the code and publish an update.
The following “Hello World” tutorial provides a simple introduction to OpenCL™. I hope to follow up this first tutorial with additional ones covering topics such as:

Using platform and device layers to build robust OpenCL™
Program compilation and kernel objects
Managing buffers
Kernel execution
Kernel programming – basics
Kernel programming – synchronization
Matrix multiply – a case study
Kernel programming – built-ins

Writing and running your first app with code executing on the CPU and the GPU
OpenCL provides many benefits in the field of high-performance computing, and one of the most important is portability. OpenCL-coded routines, called kernels, can execute on GPUs and CPUs from such popular manufacturers as Intel, AMD, Nvidia, and IBM. New OpenCL-capable devices appear regularly, and efforts are underway to port OpenCL to embedded devices, digital signal processors, and field-programmable gate arrays.

Not only can OpenCL kernels run on different types of devices, but a single application can dispatch kernels to multiple devices at once. For example, if your computer contains an AMD Fusion processor and an AMD graphics card, you can synchronize kernels running on both devices and share data between them. OpenCL kernels can even be used to accelerate OpenGL or Direct3D processing.

Despite these advantages, OpenCL has one significant drawback: it's not easy to learn. OpenCL isn't derived from MPI or PVM or any other distributed computing framework. Its overall operation resembles that of NVIDIA's CUDA, but OpenCL's data structures and functions are unique. Even the most introductory application is difficult for a newcomer to grasp. You really can't just dip your foot in the pool — you either know OpenCL or you don't.

My goal in writing this article is to explain the concepts behind OpenCL as simply as I can and show how these concepts are implemented in code. I'll explain how host applications work and then show how kernels execute on a device. Finally, I'll walk through an example application with a kernel that adds 64 floating-point values together.

Host Application Development
In developing an OpenCL project, the first step is to code the host application. This runs on a user's computer (the host) and dispatches kernels to connected devices. The host application can be coded in C or C++, and every host application requires five data structures: cl_device_id, cl_kernel, cl_program, cl_command_queue, and cl_context.

When I started learning OpenCL, I found it hard to remember these structures and how they work together, so I devised an analogy: An OpenCL host application is like a game of cards.