Discovering OpenACC 2.0:
the new data management features
Stéphane Chauveau, PhD | Dec. 27, 2013
After a first release that proved essential to move parallel computations in C/C++ and Fortran to accelerators in a standardized way, OpenACC 2.0 was long due. Now that it’s here, this series of three articles proposes to help you make the most of its new capabilities and expanded functionality. This month, we’ll start with data management, an area that really needed improvement...
The first OpenACC API specification (v1.0) was initially released in 2011 by PGI, Cray and NVIDIA with support from CAPS. At the time, it was introduced as a temporary test ground for the future accelerator extensions of OpenMP. Now, after almost two years of improvements by several new members, version 2.0 aims at becoming the de facto standard for directive-based accelerators programming in C/C++ and Fortran.
One of the recurrent complaints about OpenACC 1.0 was the lack of flexibility in data management. Offloading data to the accelerator could be achieved by creating a so-called data region that had to be perfectly nested in the code. The data construct provides the most obvious way to create a data region but implicit ones also exist around compute constructs (parallel and kernels). The standalone declare directive can also be used to define a data region corresponding to its current scope, which can be either local to the current procedure or global to the application. In any case, the rules to offload variables basically follow the scoping rules of variables in the source language.
The data construct provides the most obvious way to create a data region but implicit ones also exist around compute constructs (parallel and kernels). The declare directive is standalone but it also defines a data region corresponding to the current scope which can be either local to the current procedure or global to the application. In other words, the rules to offload variables basically follow the scoping rules of variables in the source language.
This choice makes sense from the compiler standpoint since by construction the lifetime of a data region is entirely contained in the lifetime of the host variable it refers to. Such a strategy can however become problematic in applications using more complex data structures. OpenACC 1.0 does not provide any mechanisms to manage a dynamic number of data offloads except by using tricks such as creating a recursive function to encapsulate a data region as many times as needed. But that is clearly not an elegant solution. It is also impossible to create codes in which offloaded data has to be allocated and deallocated in different functions. This is typically the case in object-oriented applications where object constructors and destructors are the obvious places to manage the lifetime of offloaded data.
This last case is illustrated in listing 1 where a function has to process by pairs an arbitrary number n of data structures provided by a vector of pointers. Since OpenACC 1.0 does not provide any mechanism to offload the n data structures simultaneously, the code keeps only two structures on the device at any time. This is of course quite inefficient since each of the n data structures has to be offloaded an average of n/2 times. Performance could probably be improved by using the recursive trick mentioned above or by managing more than one structure per data region but that would require some aggressive changes to the original code, which is not the intended purpose of a directive based API.
Data management becomes dynamic
OpenACC 2.0 solves most of these problems by introducing new dynamic data management features. Two new standalone directives, enter data and exit data, can be used to respectively create and destroy offloaded data. Together, they are basically equivalent to a data construct but without the proper nesting requirements. Accordingly, they can be arbitrarily placed anywhere in the code and, unlike other data constructs, they can even be executed asynchronously.
Listing 2 illustrates how the code in listing 1 can be optimized in OpenACC 2.0. Each of the n data structures is now offloaded to the accelerator only once (one copyin and one copyout). The enter data and exit data directives are also provided as API calls (acc_copyin, acc_copyout, acc_delete and so on).
If should be noted that since enter data and exit data do not provide a proper data region from the language standpoint, a present clause is still needed on the parallel construct to make the data accessible within the parallel region (this clause will not cause any kind of allocation or data transfer). The enter data and exit data directives and the associated API calls are therefore especially suitable for writing libraries in which data is managed and used by multiple functions.
Most of the data management features found in directives and constructs are now also provided as API calls. For instance, acc_copyin(), acc_present_or_copyin(), acc_create() and acc_present_or_create() correspond to the data clauses allowed on the enter data directive while acc_copyout() and acc_delete() correspond to the data clauses allowed on the exit data directive. These functions are unfortunately all synchronous but asynchronous alternatives will very likely make it to the next major OpenACC release, at least for a few of them.
Developers looking to implement their own data management features will also benefit from some low level API calls: acc_malloc() and acc_free() provide direct memory allocation on the current device. acc_map_data() and acc_unmap_data() associate arbitrary addresses on the host and on the device. acc_deviceptr() resolves the device address associated with a host address while acc_hostptr() performs the reverse resolution. Last, acc_memcpy_to_device() and acc_memcpy_from_device() transfer arbitrary memory regions between host and device.
Standalone data directives (and their equivalent API calls) can be mixed with data constructs but offloaded data created using one kind of directive cannot be destroyed using the other. Standalone data directives are not aware of each other so it would be incorrect, for instance, to re-implement the semantic of a present_or_copyin clause by a call to acc_present_or_copyin() followed by a call to acc_delete(). The proper implementation, as illustrated in listing 3, requires the execution of both calls only if the data was not already present.