An adaptive offline implementation selector for heterogeneous parallel platforms

Abstract

Heterogeneous parallel platforms, comprising multiple processing units and architectures, have become a cornerstone in improving the overall performance and energy efficiency of scientific and engineering applications. Nevertheless, taking full advantage of their resources comes along with a variety of difficulties: developers require technical expertise in using different parallel programming frameworks and previous knowledge about the algorithms used underneath by the application. To alleviate this burden, we present an adaptive offline implementation selector that allows users to better exploit resources provided by heterogeneous platforms. Specifically, this framework selects, at compile time, the tuple device-implementation that delivers the best performance on a given platform. The user interface of the framework leverages two C++ language features: attributes and concepts. To evaluate the benefits of this framework, we analyse the global performance and convergence of the selector using two different use cases. The experimental results demonstrate that the proposed framework allows users enhancing performance while minimizing efforts to tune applications targeted to heterogeneous platforms. Furthermore, we also demonstrate that our framework delivers comparable performance figures with respect to other approaches.

Keywords

Implementation selector heterogeneous platforms auto-tuning C++ attributes C++ concepts

1. Introduction

In recent years, heterogeneous parallel architectures have provided a way to improve performance and energy efficiency better than other alternatives. However, platforms comprising diverse devices (such as multi-cores, Graphics Processing Unit (GPUs), Digital Signal Processor (DSPs) and Field Programmable Gate Array (FPGAs)) are notoriously more difficult to program effectively, since they demand for distinct frameworks and application programming interfaces.¹ This fact has led to multiple implementations of the same algorithm but targeted to different devices. Therefore, an additional issue arises when programming for heterogeneous platforms: to select the most suitable device and routine implementation to solve a given problem. Usually, in order to improve performance, developers need to analyse a priori the target platform and the application, along with its implementation alternatives and available libraries. To achieve this goal, some aspects need to be considered. For instance, some libraries exhibit better behaviour than others for a given problem size.² Also, devices can have different features (such as the number of cores, processor frequency or memory size), and thus, they may influence, or even restrict, the use of a specific library routine.

An approach to cope with this problem is to manually select the algorithm implementation and map the execution onto the underlying parallel device based on past knowledge. Nevertheless, this procedure becomes complex when dealing with multiple devices and libraries. A common technique is to define a set of constraints in order to guide a runtime scheduler to select the most suitable implementation. This technique, however, has non-negligible performance overheads, since it is necessary to re-evaluate the selection each time a routine is called. An alternative to the aforementioned technique is to shift the decision-making task directly at compile time. Several proposals leveraging this static approach and based on analytic models, machine learning and adaptive optimization methods can be found in the literature.³ However, these approaches also incur in non-negligible profiling costs at runtime and/or modelling overheads at compile time. Therefore, there exists a trade-off between runtime overheads and profiling techniques in both approaches that needs to be considered depending on the target application. Given the foregoing, in this article, we enrich the current state-of-the-art about static approaches with the following contributions:

We present an offline implementation selector framework that leverages a profile-guided approach and is able to decide, among compilations, the tuple device-implementation that delivers the best performance based on historical information.

We enable portability of the approach using two novel features part of the standard C++ language, concepts and attributes, as for the end-user interfaces.

We evaluate the performance of the selector by analysing its convergence time and benefits to minimize execution time using the general matrix–matrix multiplication (GEMM) kernel and an image processing application.

We demonstrate that the framework self-tunes to platform changes and delivers comparable performance figures with respect to a runtime approach from the state-of-the-art.

The remainder of the article is structured as follows. Section 2 reviews the existing works about implementation selectors and decision-making models. Section 3 describes the concepts and attributes of C++ language features along with the hardware parallel platform description language (HPP-DL). Section 4 introduces our adaptive offline implementation selector framework and its decision algorithm. Section 5 evaluates the convergence time and performance benefits of the adaptive selector and compares it with a runtime approach. Finally, Section 6 closes the article with some concluding remarks and future works.

2. Related work

Since heterogeneous platforms have spread across the scientific community, different implementations of the same algorithm have been developed for specific devices. For example, several numerical libraries comprising highly tuned kernels, from BLAS and LAPACK, are available for several computing architectures, for example, cuBLAS⁴ for nVidia GPUs, GSL⁵ for multi-/many-core processors and so on. This situation reveals a new challenge: to select the most suitable device and routine implementation to solve a given problem. To tackle this issue, two different approaches have been traditionally taken: (i) runtime schedulers, able to map and execute kernels from multiple libraries on accelerators available in a heterogeneous platform and (ii) static tools that select, at compile time, the most appropriate implementation according to past knowledge. Obviously, dynamic approaches are more flexible in terms of mapping tasks to devices accordingly, for example, on load parameters, while static approaches have to use concrete implementations during the entire application run.

Some research works using static approaches can be found in the literature. For instance, the work by Wang et al.⁶ presents a mapping model based on machine learning techniques, which is tied to the training platform. Other works, as presented by Tan et al.,⁷ propose an automatic system based on source code analysis that maps user calls to optimized kernels. Similarly, Shen et al.⁸ propose an analytic system for determining which hybrid programming configuration is optimal for a given problem.

On the other hand, dynamic approaches are also greatly extended in the community. Particularly, the OmpSs⁹ programming framework leverages an extended set of OpenMP-like pragmas to support asynchronous parallelism and exploit task-parallelism of applications via data dependencies. Concretely, among the pragma options, the target directive allows developers to select the target device in a heterogeneous platform. Together with this directive, the implements clause lets users to specify that the annotated code is an alternate implementation of a given function for a specific device. This feature allows its versioning runtime scheduler to freely map the same task onto different devices. Other works in this line, such as the extension for the SkePu framework presented in the study by Dastgeer et al.,¹⁰ take advantage of a machine learning mechanism in order to select automatically the most appropriate implementation of a given function. These models basically carry out a tuning phase to estimate the ranges in which different implementations perform better than others and to take the most appropriate one. Following a similar approach, the tool presented in this article, leverages C++ attributes and concepts to select between implementations and devices from the heterogeneous platform. However, it shifts the selection process directly at compile time. Obviously, this approach restricts the flexibility but offers benefits in terms of memory usage for embedded systems.

3. Background

This section gives a brief overview about the two C++ language features used for developing the implementation selector interface: C++ attributes and concepts. Furthermore, we describe the HPP-DL leveraged by the selector to keep track of available resources.

3.1. The C++11 attributes

Attributes are a new feature of the C++11 language for defining properties to programming entity units. The power of the attributes is that they provide extra information to the compiler in order to perform a given action to the entity that has been annotated. One of the main advantages of the attributes with respect to pragmas is their flexibility, as they can provide properties to any entity, for example, variables, functions or types and do not need to appear on a separate line.¹¹ The basic syntax for attributes in C++11 is defined with: [[namespace::attribute]].

However, unlike attributes from other languages, C++11 attributes are compiler-defined, that is, the user cannot define their own attributes at runtime. Indeed, the C++ runtime type identification mechanism does not keep any attribute information, so the knowledge given in the attributes is not accessible from the user application. On the contrary, attribute extensions require modifications in the C++ compiler. Therefore, the purpose of this feature is to allow future C++ extensions without extending the set of keywords nor grammar.

3.2. The C++ concepts

Concepts are a novel extension of the C++ programming language that allows defining and evaluating, at compile time, constraints set on template arguments.¹² Particularly, concepts deliver a better support for error checking in generic programming contexts, thus, they can be seen as a mechanism to prevent and diagnose improper uses of templates. In general, concepts allow to overloading templates and disabling those whose types do not satisfy the predefined constraints. Lexically, concepts can be expressed with the keywords concept and requires. Their current implementation can be found in the GCC C++ experimental branch, namely concepts lite.¹³

To enable portability of our framework, we use both attributes and concepts as for the user interfaces. However, given that the concepts are a novel feature of the language, they are not yet supported by all the current C++ compilers. In contrast, attributes are a feature of the C++ language that is already supported by any compiler. In a nutshell, attributes are easier to use than concepts but at the expense of modifying the compiler internals or using an external parsing tool to process them.

3.3. The HPP-DL

The HPP-DL is an open specification that leverages hierarchical models for describing features of heterogeneous parallel platforms.¹⁴ This language is intended to be used for making platform-specific information available to developers and tools, such as auto-tuners, compilers, schedulers or runtime systems.

This specification comprises information about the following classes: (i) components describe the hardware resources of the platform, involving processors (CPU sockets), cores, main memory, GPUs or OpenCL-based devices; (ii) links represent relationships between two components in one way and incorporate information about data transmission: throughput and latency and (iii) resources portray interfaces for accessing to computing devices attached, for example, FPGAs or DSPs. These resources comprise I/O ports, IRQs or address ranges.

4. The adaptive offline implementation selector

In this section, we describe the adaptive and offline implementation selector framework as the main contribution of this article. Figure 1 shows the general workflow of this framework. As can be seen, in a first step, the system administrator provides users with all the possible implementations of a function by producing, using the header generation tool, all necessary function wrappers for the two proposed interfaces, that is, attributes and concepts. This command-line tool takes, for each function to be replaced, the following arguments: (i) the function name, (ii) all the available implementation names (e.g. GSL or clBLAS) and (iii) the supported devices (e.g. CPU or GPU). With this information, the command-line tool generates the wrappers whose name is a concatenation of the three aforementioned arguments. Afterwards, the function wrappers in the generated header files should be completed by the administrator with the corresponding code for calling to specific implementations. Note that all generated headers are automatically instrumented for measuring execution time. This is done in order to support the profile-guided approach implemented by the selector. Additionally, the headers generated incorporate the required code to support the framework interfaces. In parallel, the administrator should also obtain required hardware information, using the automatic tool provided for storing it into an HPP.json, according to the HPP-DL specification.

Figure 1.

Diagram of the adaptive offline implementation selector.

On the other hand, users should call the interfaces, generated by the administrator and present in the header files, and provide static information (e.g. problem size) to the selector in the form of attributes or template arguments. Next, the attribute and concepts-based implementation selectors call, respectively, to the selector module so as to determine the most suitable implementation based on historical data and the static information provided by the user. Finally, once the application has been compiled and executed, the performance file (PERF.h) is updated automatically with the new performance data (e.g. ‘problem size–execution time’ tuples) in order to improve the selector knowledge in future decisions.

In the following sections, we describe how the concepts and the attribute mechanisms should be used within the framework. Next, we enumerate the steps taken by the selector module.

The attributes-based interface

The interface of the framework based on C++ attributes is intended to define constraints in order to guide the compiler to select a concrete implementation among the available ones. These attributes are part of our previous work presented in the study by Sanchez et al.¹⁵

4.1.1. Administrator actions

First, the administrator needs to run the header generation tool, with the attributes flag option set, so as to generate the headers containing the annotated function wrappers for a given algorithm. Listing 1 shows an example of generated code for the dgemm kernel. Next, the administrator should complete the attribute-annotated wrapper bodies with the corresponding function calls for the interfaces generated. The attributes inserted, under the ais (adaptive implementation selector) namespace, are as follows:

ais::implements: This attribute specifies that the code under the attribute is an alternate implementation of a given interface. Basically, it receives, as a single parameter, the function name to let the selector know which implementations are available for that interface.

ais::device: This attribute bounds a given implementation to a specific target device. Examples of valid parameters for this attribute are CPU, GPU, PHI (for the Intel Xeon Phi) and so on.

Listing 1.

Generated header file for attribute-annotated function implementations.

4.1.2. Users actions

On the other hand, the users are responsible for annotating candidate function calls in the application code in order to be analysed by the attribute-based implementation selector tool of the framework. These C++ attributes are as follows:

ais::interface: This attribute indicates that the annotated function call is an interface and should be treated by the tool and replaced during the selection process.

ais::target: It defines the preferred target device to execute an annotated interface, for example, ais::target(CPU). Valid arguments for this attribute are those accepted by the ais::device attribute.

ais::size: This attribute is used when the user knows a priori the problem size during the function call. As a single parameter, it receives an integer indicating the problem size.

ais::min and ais::max: Alternatively, when the user does not know the specific problem size, these attributes can be used to let the selector know the lower and upper bounds of the same used to execute a certain function.

For instance, Listing 2 contains an example of user code with different C++ attribute-annotated function calls that match the interface dgemm defined by the administrator in the header file. As can be observed, both first and second calls to dgemm have been annotated using the attributes for a fixed size and a range of sizes, respectively. After the framework preprocesses the annotated source code, the interfaces are automatically replaced by the most suitable function implementations. In the example of Listing 3, both dgemm calls have been replaced by function calls to the clBLAS and GSL kernel implementations, respectively.

Listing 2.

User-annotated function interfaces.

Listing 3.

Replaced attributes by implementations.

4.2. The concepts-based interface

The interface of the framework based on concepts is an alternative to the aforementioned mechanism based on attributes that pursue the same goal.

4.2.1. Administrator actions

In this case, the system administrator uses the header generation tool with the concepts flag option set to introduce an algorithm implementation and the devices supported. Listing 4 shows an example of generated code for the dgemm kernel. This tool generates a header file containing the function templates with the concepts in charge of determining the most suitable implementation for a given input size or range of sizes. Next, the function templates in the header file are to be completed manually with the corresponding function calls to specific implementations for such an algorithm.

Listing 4.

Generated header file for concepts function implementations.

4.2.2. User actions

Moreover, the users call the supported function interfaces providing the necessary information, in the form of template arguments, so as to guide the compiler through the framework to link against concrete implementations. As an example, the first call to function dgemm in Listing 5 provides a fixed problem size of the matrices, while the second provides a range of sizes. All in all, this static information allows the compiler to select the dgemm implementation thanks to the concepts introduced by the header generation tool and the requirements specified by the user.

Listing 5.

Concepts mechanism used in the dgemm function interface.

4.3. The selector module

This section describes the internal algorithm of the adaptive implementation selector module. This algorithm has been implemented for accepting the two aforementioned interfaces: attributes and concepts. Note that the attributes-based implementation selector has been developed as a preprocessing tool using the Clang 3.8.0 compiler Application Program Interface (API), while the concepts-based selector is implicitly embedded into the semantic concepts code and interpreted by the compiler itself.

The selection algorithm implemented by our framework is entirely based on the problem size and boundaries specified by the user. Depending on this information, the algorithm proceeds as follows:

If the template argument or attribute for a fixed size is set, the selector takes the implementation offering the minimum execution time. This is done using the information stored in the PERF.h. To do so, the selector performs a linear interpolation for the requested problem size for all implementations available in such a function interface. Otherwise, if multiple implementations deliver the same minimum performance, the selector randomly picks one of them. Consider the scenario in Figure 2(a) showing the behaviour of the dgemm interface offering five implementations. For instance, if the user sets the size parameter to 1024, the selector will consider the clBLAS version running on CPU, while if the problem size is fixed to 448, the selector will randomly select GSL or clBLAS (Xeon). To avoid extreme behaviours in some cases, this random policy can be eventually replaced by another that takes into account the lower maximum performance.

On the contrary, if the developer has indicated a range of possible problem sizes, the selector module computes the area under the performance curve (or integral) between the ranges for each implementation available. This information allows the selector to take the implementation that has the smallest area between the ranges. As shown in Figure 2(b), if the user selects the range 256–512, the selector module will compute the integrals for the five implementations available. Afterwards, it will compare the areas below the curves and take that having the smallest area, that is, GSL. Note that if there are no performance values in the range boundaries, the values that intersect the boundaries are computed via linear interpolation. As in the previous option using a fixed size, if there are two or more implementations whose integral value is equal, the selector will pick one randomly.

Figure 2.

Execution time of the dgemm operation for different problem sizes and implementations. Note the region boundaries set by the selector module. (a) Scenario using fixed sizes, (b) Scenario using a range of sizes.

Note that, at present, the selection algorithm only considers the problem size to select the fastest implementation. In the future, we plan to extend the set of user template arguments and attributes to allow users to specify other kinds of restrictions, such as memory usage or energy consumption. This will allow the selector to make multi-objective optimizations.

Experimental evaluation

In this section, we evaluate the behaviour of our adaptive offline implementation selector using the dense matrix–matrix multiplication (GEMM) and an image processing application targeted to embedded systems (StereoBM). First, we evaluate the accuracy and convergence of the selector algorithm of the framework. Next, we study the adaptability to changes of the selection framework in heterogeneous platforms. Finally, we compare our solution using the OmpSs versioning runtime scheduler.

The platform used to assess the framework is equipped with an Intel Xeon processor, two AMD Radeon GPUs (connected via PCI Express 3.0) and an Intel Xeon Phi co-processor. Table 1 describes the details of these components. The OS running on this platform is a Linux Ubuntu 14.04 and the compiler is GCC v5.0 with the concepts-lite extension enabled.

Table 1.

Heterogeneous test bed platform.

	Xeon	Amd1	Amd2	Mic
Model	Intel	AMD	AMD	Intel
	Xeon^®	Radeon™	Radeon™	Xeon Phi™
	CPU E5-2695	R9 290X series	R9 285 series	3120 series
Core clock	2.4 GHz	1030 MHz	928 MHz	1.1 GHz
Computing units	24	44 (2816)	28 (1792)	224
Memory	128 GB	4 GB	2 GB	6 GB
OpenCL version	AMD-APP OpenCL 2.0			Intel OpenCL™
	1642.5			Runtime 14.2

Regarding the use cases, for the GEMM, we employ the CPU implementation from the GSL library, while the clBLAS implementation is intended to run on all available devices. Figure 2, from the previous section, shows the execution times for these implementations using square matrix sizes ranging from 128 to 4096. On the other hand, for the StereoBM use case, we use the sequential and Intel TBB framework using OpenCV routines versions for CPU and accelerators.

5.1. Evaluation of the accuracy and performance

In this section, we evaluate the accuracy and performance of the implementation selector for both use cases. Figure 3 depicts the accuracy of the decisions made by the selector and the performance attained using fixed sizes and ranges of sizes through different number of training iterations. In order to train the system, in each iteration, we execute an instance of the dgemm kernel and the StereoBM application with random matrix sizes and image resolutions, respectively. Furthermore, to evaluate the accuracy of each training iteration, we perform 100 runs using random problem sizes. Next, for each of them, we compute the miss rate and the performance rate, which is obtained dividing the execution time of the fastest and the selected implementation. Note as well that calls to the dgemm kernel and OpenCV routines in the StereoBM application have been provided with information about the problem size or range of sizes and processed prior the compilation phase.

Figure 3.

Progress of the miss and performance rates of the selector for the GEMM and STEREOBM. (a) Miss and performance rates for GEMM, (b) Miss and performance rates for STEREOBM.

As can be seen in Figure 3(a), the miss rate progress for the GEMM case using both fixed and range of sizes decreases in a smooth curve until reaching, after 500 training iterations, 2% and 5% of the total accuracy, respectively. Similarly, the performance rate steadily increases until 100% in both cases with fixed and range of sizes. This is mainly because the selector has already gained enough knowledge about the performance of the different implementations used. Therefore, all selections made from that point will provide good performance figures. Similarly, the miss rate progress for the StereoBM application, depicted in Figure 3(b), decreases more rapidly than for the GEMM case, as it has less combinations of device implementation available. This behaviour finally leads to low miss rates: 8% and 3% for fixed and ranges of sizes after 500 iterations, respectively. In the same way, the performance rates at the 500th iteration are close to 100% for both fixed and range of sizes. In general, we observe that the miss rate for fixed problem sizes is slightly lower than for range of sizes. This is due to the size parameter provided by the user is a more accurate indicator than freely selecting a size between a given ranges. In addition, miss rates are mostly inversely proportional to performance rates; therefore, the low miss rates observed are indicators for good performance figures.

Evaluation of the adaptability

Next, we evaluate the adaptability of the selector to make appropriate decisions when a new device (and implementation) is attached to the heterogeneous platform each 100 training iterations. Figure 4 analyses the quality of the selections in this scenario for both GEMM and StereoBM use cases using range of sizes. Focusing on the GEMM case, the selector reaches about 98% of accuracy after 100 iterations. When a new device implementation is added, we observe that the selector needs only 20 additional iterations to converge once again and reach similar accuracy figures than before the change. Looking at the StereoBM application, we notice that when we add a new version, the knowledge considerably drops compared to the GEMM case; however, the selector needs barely 25 iterations to stabilize once again. As a final remark, we observe that the accuracy reached by the selector when new devices are steadily attached to the platform is slightly higher than if all devices are attached right at the beginning of the training process. This is given because the selector needs to train less if only one additional device is attached each time. All in all, the low numbers of additional training iterations demonstrate the ability of the framework to react in front of platform changes.

Figure 4.

Progress of the accuracy of the selector for the GEMM and STEREOBM when adding new devices. (a) Accuracy using ranges of sizes for GEMM, (b) Accuracy using ranges of sizes for STEREOBM.

Comparison with alternative approaches

As stated in Section 2, several static and dynamic approaches that allow selecting different implementations of a same algorithm can be found in the literature. In this section, we validate the performance benefits of our adaptive offline implementation selector with a real runtime scheduler. Concretely, we compare our approach with the versioning runtime scheduler counterpart from the OmpSs programming model,⁹ as it offers a similar implementation selector to our static solution.

To compare our solution (AIS) with OmpSs, we developed an application composed of two 50 iteration loops that computes the matrix–matrix product (dgemm kernel) using square matrices of sizes 256 and 2048, respectively. Take into account that the multiplication is performed using the same dgemm implementations as in the previous experiments. For AIS, we annotate the dgemm kernel calls using the attribute ais::size for both matrix sizes, while in OmpSs, we define different tasks for the available implementations that are annotated with the implements and target directives.

Figure 5 depicts the execution progress of this application. As observed, AIS starts from the first loop iteration selecting the implementations that perform best for the different matrix sizes. Note that AIS has been previously trained performing 100 executions of the dgemm kernel with random matrix sizes and the measured profiling overhead is not higher than 1%. On the contrary, OmpSs cannot be trained offline, so it makes a few trial runs of the different implementations until it finds, at runtime, the fastest one. In these cases, the single training phase of AIS pays off the OmpSs trial runs and the runtime scheduler overhead. Specifically, the OmpSs measured overhead ranges between 2% and 40% for the large and small matrix sizes, respectively. In general, AIS offers a static implementation selector that is able to train among executions and is adaptive at compile time; however, OmpSs offers a more flexible alternative that requires a runtime scheduler to make decisions at the expense of non-negligible overheads.

Figure 5.

Execution progress of two 50 iteration loops computing the dgemm using AIS and OmpSs.

Conclusions

In this article, we have presented an adaptive offline implementation selector for heterogeneous platform that selects automatically the tuple device implementation delivering the best performance according to the problem size and historical information. Thanks to this approach, our framework shifts the decision-making process at compile time, so that overheads related to dynamic scheduling approaches are also shifted in a negligible profiling process. Furthermore, our framework is hardware independent; therefore, it is possible to freely add or remove devices of the platform without incurring significant overheads. To enable portability, our implementation selector leverages two novel C++ features (attributes and concepts), inherent to the standard C++ programming language.

To evaluate the benefits of this framework, we analysed the performance and accuracy of the tool using two different use cases: the GEMM and an image processing application. The experimental results prove that the selector enhances performance while minimizes efforts to tune applications targeted to heterogeneous platforms. Furthermore, the results also show that our framework delivers comparable performance figures with respect to the OmpSs versioning runtime scheduler.

As future work, we plan to extend the set of C++ attributes in order to allow users to specify other kinds of restrictions, such as memory usage or energy consumption. Furthermore, we also aim to incorporate a static partitioning module for supporting multiple devices in shared and distributed memory systems. Another goal is to integrate the attribute-based selector tool as part of the Clang C++ compiler.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has been partially supported by the Spanish ‘Ministerio de Economía y Competitividad’ under the project grant TIN2016-79637-P ‘Towards Unification of High Performance Computing (HPC) and Big Data Paradigms’ and the EU Projects ICT 644235 ‘RePhrase: REfactoring Parallel Heterogeneous Resource-Aware Applications’ and the FP7 609666 ‘Repara: Reengineering and Enabling Performance And poweR of Applications’.

References

Kindratenko

Trancoso

(2011) Trends in high-performance computing. Computing in Science Engineering 13(3): 92–95.

Sanchez

Fernandez

Sotomayor

. (2013) A comparative study and evaluation of parallel programming models for shared-memory parallel architectures. New Generation Computing 31(3): 139–161.

Daoud

Kharma

(2006) Efficient compile-time task scheduling for heterogeneous distributed computing systems. In: Silvia

(ed.) Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, Minneapolis, Minnesota, USA, 12–15 July 2006, Vol 1, p. 9. IEEE Computer Society. DOI: 10.1109/ICPADS. 2006.40.

nVidia (2012) cuBLAS Library User Guide. v5.0 edition. Godalming, Surrey, UK: nVidia.

Gough

(2009) GNU Scientific Library Reference Manual - Third Edition. 3rd ed. Los Alamitos, CA, USA: Network Theory Ltd. ISBN 0954612078.

Wang

Grewe

O’Boyle

MFP

(2014) Automatic and portable mapping of data parallel programs to openCL for GPU-based heterogeneous systems. ACM Trans Archit Code Optim 11(4): 42: 1–42.

Tan

Tang

Goh

. (2015) A code generation framework for targeting optimized library calls for multiple platforms. Parallel and Distributed Systems, IEEE Transactions on 26(7): 1789–1799.

Shen

Varbanescu

Sips

(2014) Look before you leap: using the right hardware resources to accelerate applications. In: High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS), 2014 IEEE Intl Conf on. pp. 383–391.

Duran

Ayguadé

Badia

. (2011) Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21: 173–193. DOI: http://dx.doi.org/10.1142/S0129626411000151.

10.

Dastgeer

Kessler

(2013) Adaptive Implementation Selection in the SkePU Skeleton Programming Library. In: Wu

Cohen

(eds) Advanced Parallel Processing Technologies: 10th International Symposium, APPT 2013, Stockholm, Sweden, 27–28 August 2013, (Revised Selected Papers). pp. 170–183. Berlin, Heidelberg: Springer Berlin Heidelberg.

11.

Maurer

Wong

(2008) Towards support for attributes in C++ (Revision 6). In: JTC1/SC22/WG21 - The C++ Standards Committee. N2761=08-0271.

12.

Sutton

(2014) Working Draft, C++ Extensions for Concepts. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4040.pdf (accessed 1 September 2016).

13.

Sutton

Stroustrup

(2012) Concepts lite: constraining templates with predicates. http://concepts.axiomatics.org/ans/concepts-lite.pdf (accessed 5 September 2016).

14.

REPARA project (2014) Target Platform Description Specification. https://repara-project.eu/wp-content/uploads/2014/04/ICT-609666-D3.1.pdf (accessed 1 October 2016).

15.

Sanchez

del Rio Astorga

Dolz

. (2016) CID: a compile-time implementation decider for heterogeneous platforms based on C++ attributes. In: O’Conner

(ed.) 2016 Intl IEEE Scalable Computing and Communications, Tolouse, 18–21 July 2016, pp. 1149–1156. IEEE Computer Society.