Developing an OP2 Application

This page provides a tutorial in the basics of using OP2 for unstructured-mesh application development.

Example Application

The tutorial will use the Airfoil application, a simple non-linear 2D inviscid airfoil code that uses an unstructured mesh. It is a finite volume application that solves the 2D Euler equations using a scalar numerical dissipation. The algorithm iterates towards the steady state solution, in each iteration using a control volume approach - for example the rate at which the mass changes within a control volume is equal to the net flux of mass into the control volume across the four faces around the cell.

Airfoil consists of five loops, save_soln , adt_calc , res_calc , bres_calc and update, within a time-marching iterative loop. Out of these, save_soln and update are what we classify as direct loops where all the data accessed in the loop is defined on the mesh element over which the loop iterates over. Thus for example in a direct loop, a loop over edges will only access data defined on edges. The other three loops are indirect loops. In this case when looping over a given type of elements, data on other types of elements will be accessed indirectly, using mapping tables. Thus for example res_calc iterates over edges and increments data on cells, accessing them indirectly via a mapping table that gives the explicit connectivity information between edges and cells.

The standard mesh size solved with Airfoil consists of 1.5M edges. Here the most compute intensive loop is res_calc, which is called 2000 times during the total execution of the application and performs about 100 floating-point operations per mesh edge.

Go to the OP2/apps/c/airfoil/airfoil/airfoil_tutorial/original directory and open the airfoil_orig.cpp file to view the original application.
Use the Makefile in the same directory to compile and then run the application. The new_grid.dat (downloadable from here) needs to be present in the same directory as the executable.
The program executes by reporting the rms value of the pressure held on the cells for every 100 iterations and ends by comparing this value to the reference value of the computation at 1000 iterations. If the solution is within machine precision of the reference value, the execution is considered to be validated (PASS). This is the same criteria required to validate any parallel versions of the application, including the paralleizations generated by OP2 as detailed below.

Original - Load mesh and initialization

The original code begins with allocating memory to hold the mesh data and then initializing them by reading in the mesh data, form the new_grid.dat text file:

FILE *fp;
if ((fp = fopen(FILE_NAME_PATH, "r")) == NULL) {
  printf("can't open file FILE_NAME_PATH\n");
  exit(-1);
}
if (fscanf(fp, "%d %d %d %d \n", &nnode, &ncell, &nedge, &nbedge) != 4) {
      printf("error reading from FILE_NAME_PATH\n");
      exit(-1);
}

cell   = (int *)malloc(4 * ncell  * sizeof(int));
edge   = (int *)malloc(2 * nedge  * sizeof(int));
ecell  = (int *)malloc(2 * nedge  * sizeof(int));
bedge  = (int *)malloc(2 * nbedge * sizeof(int));
becell = (int *)malloc(1 * nbedge * sizeof(int));
bound  = (int *)malloc(1 * nbedge * sizeof(int));
x      = (double *)malloc(2 * nnode * sizeof(double));
q      = (double *)malloc(4 * ncell * sizeof(double));
qold   = (double *)malloc(4 * ncell * sizeof(double));
res    = (double *)malloc(4 * ncell * sizeof(double));
adt    = (double *)malloc(1 * ncell * sizeof(double));

for (int n = 0; n < nnode; n++) {
      if (fscanf(fp, "%lf %lf \n", &x[2 * n], &x[2 * n + 1]) != 2) {
        printf("error reading from FILE_NAME_PATH\n");
        exit(-1);
      }
}

for (int n = 0; n < ncell; n++) {
  if (fscanf(fp, "%d %d %d %d \n", &cell[4 * n], &cell[4 * n + 1],
      &cell[4 * n + 2], &cell[4 * n + 3]) != 4) {
    printf("error reading from FILE_NAME_PATH\n");
    exit(-1);
  }
}
for (int n = 0; n < nedge; n++) {
      if (fscanf(fp, "%d %d %d %d \n", &edge[2 * n], &edge[2 * n + 1],
      &ecell[2 * n], &ecell[2 * n + 1]) != 4) {
    printf("error reading from FILE_NAME_PATH\n");
        exit(-1);
  }
}
for (int n = 0; n < nbedge; n++) {
  if (fscanf(fp, "%d %d %d %d \n", &bedge[2 * n], &bedge[2 * n + 1],
      &becell[n], &bound[n]) != 4) {
  printf("error reading from FILE_NAME_PATH\n");
  exit(-1);
  }
}
fclose(fp);

The code then initialize q and res data arrays to 0.

Original - Main iteration and loops over mesh

The main iterative loop is a for loop that iterates for some NUM_ITERATIONS which in this case is set to 1000 iterations. Within this main iterative loops there are 5 loops over various mesh elements (as noted above) including direct and indirect loops.

Build OP2

Build OP2 using instructions in the Getting Started. page.

Step 1 - Preparing to use OP2

First, include the following header files, then initialize OP2 and finalize it as follows:

#include "op_seq.h"
...
...
int main(int argc, char **argv) {
  //Initialise the OP2 library, passing runtime args, and setting diagnostics level to low (1)
  op_init(argc, argv, 1);
  ...
  ...
  ...
  free(adt);
  free(res);

  //Finalising the OP2 library
  op_exit();
}

By this point you need OP2 set up - take a look at the Makefile in step1, and observe that the include and library paths are added, and we link against op2_seq back-end library. Using the op_seq.h header file and linking with the sequential back-end OP2 lib produces what we call a developer sequential version of the application. This should be used to successively develop the rest of the application by using the OP2 API and validate its numerical output, as we do so in the next steps.

Step 2 - OP2 Declaration

Declare sets - The Airfoil application consists of four mesh element types (which we call sets): nodes, edges, cells and boundary edges. These needs to be declared using the op_set API call together with the number of elements for each of these sets:

// declare sets
op_set nodes  = op_decl_set(nnode,  "nodes" );
op_set edges  = op_decl_set(nedge,  "edges" );
op_set bedges = op_decl_set(nbedge, "bedges");
op_set cells  = op_decl_set(ncell,  "cells" );

Later, we will see how the number of mesh elements can be read in directly from an hdf5 file using the op_set_hdf5 call.

When developing your own application with OP2, or indeed converting an application to use OP2, you will need to decide on what mesh element types, i.e. sets will need to be declared to define the full mesh. A good starting point for this design is to see what mesh elements are used the loops over the mesh.

Declare maps - Looking at the original Airfoil application’s loops we see that mappings between edges and nodes, edges and cells, boundary edges and nodes, boundary edges and cells, and cells and nodes are required. This can be observed by the indirect access to data in each of the loops in the main iteration loops. These connectivity information needs to be declared via the op_decl_map API call:

//declare maps
op_map pedge   = op_decl_map(edges,  nodes, 2, edge,   "pedge"  );
op_map pecell  = op_decl_map(edges,  cells, 2, ecell,  "pecell" );
op_map pbedge  = op_decl_map(bedges, nodes, 2, bedge,  "pbedge" );
op_map pbecell = op_decl_map(bedges, cells, 1, becell, "pbecell");
op_map pcell   = op_decl_map(cells,  nodes, 4, cell,   "pcell"  );

The op_decl_map requires the names of the two sets for which the mapping is declared, its arity, mapping data (as in this case allocated in integer blocks of memory) and a string name.

Declare data - All data declared on sets should be declared using the op_decl_dat API call. For Airfoil this consists of the mesh coordinates data x, new and old solution q and q_old, area time step adt, flux residual res and boundary flag bound that indicates if the edge is a boundary edge:

//declare data on sets
op_dat p_bound = op_decl_dat(bedges, 1, "int",    bound, "p_bound");
op_dat p_x     = op_decl_dat(nodes,  2, "double", x,     "p_x"    );
op_dat p_q     = op_decl_dat(cells,  4, "double", q,     "p_q"    );
op_dat p_qold  = op_decl_dat(cells,  4, "double", qold,  "p_qold" );
op_dat p_adt   = op_decl_dat(cells,  1, "double", adt,   "p_adt"  );
op_dat p_res   = op_decl_dat(cells,  4, "double", res,   "p_res"  );

Declare constants - Finally global constants that are used in any of the computations in the loops needs to be declared. This is required due to the fact that when using code-generation later for parallelizations such as on GPUs (e.g. using CUDA), global constants needs to be copied over to the GPUs before they can be used in a GPU kernel. Declaring them using the op_decl_const API call will indicate to the OP2 code-generator that these constants needs to be handled in a special way, generating code for copying them to the GPU for the relevant back-ends.

//declare global constants
op_decl_const(1, "double", &gam  );
op_decl_const(1, "double", &gm1  );
op_decl_const(1, "double", &cfl  );
op_decl_const(1, "double", &eps  );
op_decl_const(1, "double", &alpha);
op_decl_const(4, "double", qinf  );

Finally information about the the declared mesh can be viewed using a diagnostics level of 2 in op_init and calling op_diagnostic_output() API call:

/* main application */
int main(int argc, char **argv) {
//Initialise the OP2 library, passing runtime args, and setting diagnostics level to low (1)
op_init(argc, argv, 2);
...
... op_decl_set ...
... op_decl_map ...
... op_decl_dat ...
... op_decl_const  ...
...
//output mesh information
op_diagnostic_output();

Finally compile the step2 application and execute using the following runtime flag:

./airfoil_step2 OP_NO_REALLOC

The OP_NO_REALLOC runtime flag instructs to the OP2 back-end to simply use the already allocated memory for sets, maps and data in the declaration, without internally de-allocating them. This helps the developer to gradually build up the application with the conversion to OP2 API (as we are do here), checking for validation on each step. However, this behavior is only specified for the developer sequential version, which we are developing here. None of the parallel versions generated via the code generator nor the code generated sequential version gen_seq will work as they de-allocate the initial memory and move the mesh to obtain best parallel performance.

Step 3 - First parallel loop : direct loop

We can now convert the first loop to use the OP2 API. In this case its a direct loop called save_soln that iterates over cells and saves the previous time-iteration’s solution, q to q_old:

//save_soln : iterates over cells
for (int iteration = 0; iteration < (ncell * 4); ++iteration) {
  qold[iteration] = q[iteration];
}

This is a direct loops due to the fact that all data accessed in the computation are defined on the set that the loop iterates over. In this case the iteration set is cells.

To convert to the OP2 API we first outline the loop body (elemental kernel) to a subroutine:

//outlined elemental kernel
inline void save_soln(double *q, double *qold) {
for (int n = 0; n < 4; n++)
  qold[n] = q[n];
}

//save_soln : iterates over cells
for (int iteration = 0; iteration < (ncell * 4); ++iteration) {
  save_soln(&q[iteration], &qold[iteration]);
}

Now we can directly declare the loop with the op_par_loop API call:

op_par_loop(save_soln, "save_soln", cells,
            op_arg_dat(p_q,    -1, OP_ID, 4, "double", OP_READ ),
            op_arg_dat(p_qold, -1, OP_ID, 4, "double", OP_WRITE));

Note how we have:

indicated the elemental kernel save_soln in the first argument to op_par_loop
used the op_dat``s names ``p_q and p_qold in the API call
noted the iteration set cells (3rd argument)
indicated the direct access of q and q_old using OP_ID
indicated that p_q is read only (OP_READ) and q_old is written to only (OP_WRITE), by looking through the elemental kernel and identifying how they are used/accessed in the kernel.
given that p_q is read only we also indicate this by the key word const for save-soln elemental kernel.
The fourth argument of an op_arg_dat is the dimension of the data. For p_q and p_qold there are 4 doubles per mesh point.

Compile and execute (again using OP_NO_REALLOC) the modified application (see code in ../step3) and check if the solution validates.

Step 4 - Indirect loops

The next loop in the application adt_calc calculate area/timstep and iterates over cells. In this case we see that the loop is an indirect loop where the data x on the four nodes connected to a cell is accessed indirectly via a cells to nodes mapping. Additionally data adt are accessed directly where adt is data on the cells:

//adt_calc - calculate area/timstep : iterates over cells
for (int iteration = 0; iteration < ncell; ++iteration) {
  int map1idx = cell[iteration * 4 + 0];
  int map2idx = cell[iteration * 4 + 1];
  int map3idx = cell[iteration * 4 + 2];
  int map4idx = cell[iteration * 4 + 3];

  double dx, dy, ri, u, v, c;

  ri = 1.0f / q[4 * iteration + 0];
  u = ri * q[4 * iteration + 1];
  v = ri * q[4 * iteration + 2];
  c = sqrt(gam * gm1 * (ri * q[4 * iteration + 3] - 0.5f * (u * u + v * v)));

  dx = x[2 * map2idx + 0] - x[2 * map1idx + 0];
  dy = x[2 * map2idx + 1] - x[2 * map1idx + 1];
  adt[iteration] = fabs(u * dy - v * dx) + c * sqrt(dx * dx + dy * dy);

  dx = x[2 * map3idx + 0] - x[2 * map2idx + 0];
  dy = x[2 * map3idx + 1] - x[2 * map2idx + 1];
  adt[iteration] += fabs(u * dy - v * dx) + c * sqrt(dx * dx + dy * dy);

  dx = x[2 * map4idx + 0] - x[2 * map3idx + 0];
  dy = x[2 * map4idx + 1] - x[2 * map3idx + 1];
  adt[iteration] += fabs(u * dy - v * dx) + c * sqrt(dx * dx + dy * dy);

  dx = x[2 * map1idx + 0] - x[2 * map4idx + 0];
  dy = x[2 * map1idx + 1] - x[2 * map4idx + 1];
  adt[iteration] += fabs(u * dy - v * dx) + c * sqrt(dx * dx + dy * dy);

  adt[iteration] = (adt[iteration]) / cfl;
}

Similar to the direct loop, we outline the loop body and call it within the loop as follows:

//outlined elemental kernel - adt_calc
inline void adt_calc(double *x1, double *x2, double *x3,
                     double *x4, double *q, double *adt) {
  double dx, dy, ri, u, v, c;

  ri = 1.0f / q[0];
  u = ri * q[1];
  v = ri * q[2];
  c = sqrt(gam * gm1 * (ri * q[3] - 0.5f * (u * u + v * v)));

  dx = x2[0] - x1[0];
  dy = x2[1] - x1[1];
  *adt = fabs(u * dy - v * dx) + c * sqrt(dx * dx + dy * dy);

  dx = x3[0] - x2[0];
  dy = x3[1] - x2[1];
  *adt += fabs(u * dy - v * dx) + c * sqrt(dx * dx + dy * dy);

  dx = x4[0] - x3[0];
  dy = x4[1] - x3[1];
  *adt += fabs(u * dy - v * dx) + c * sqrt(dx * dx + dy * dy);

  dx = x1[0] - x4[0];
  dy = x1[1] - x4[1];
  *adt += fabs(u * dy - v * dx) + c * sqrt(dx * dx + dy * dy);

  *adt = (*adt) / cfl;
}

//adt_calc - calculate area/timstep : iterates over cells
for (int iteration = 0; iteration < ncell; ++iteration) {
  int map1idx = cell[iteration * 4 + 0];
  int map2idx = cell[iteration * 4 + 1];
  int map3idx = cell[iteration * 4 + 2];
  int map4idx = cell[iteration * 4 + 3];

  adt_calc(&x[2 * map1idx], &x[2 * map2idx], &x[2 * map3idx],
           &x[2 * map4idx], &q[4 * iteration], &adt[iteration]);
}

Now, convert the loop to use the op_par_loop API:

//res_calc - calculate flux residual: iterates over edges
op_par_loop(adt_calc, "adt_calc", cells,
            op_arg_dat(p_x,   0, pcell, 2, "double", OP_READ ),
            op_arg_dat(p_x,   1, pcell, 2, "double", OP_READ ),
            op_arg_dat(p_x,   2, pcell, 2, "double", OP_READ ),
            op_arg_dat(p_x,   3, pcell, 2, "double", OP_READ ),
            op_arg_dat(p_q,  -1, OP_ID, 4, "double", OP_READ ),
            op_arg_dat(p_adt,-1, OP_ID, 1, "double", OP_WRITE));

Note in this case how the indirections are specified using the mapping declared as OP2 map pcell, indicating the to-set index (2nd argument), and access mode OP_READ.

Step 5 - Global reductions

At this stage almost all the remaining loops can be converted to the OP2 API. Only the final loop update that updates the flow field needs special handling due to its global reduction:

rms = 0.0f;
for (int iteration = 0; iteration < ncell; ++iteration) {
  double del, adti;

  adti = 1.0f / (adt[iteration]);

  for (int n = 0; n < 4; n++) {
    del = adti * res[iteration * 4 + n];
    q[iteration * 4 + n] = qold[iteration * 4 + n] - del;
    res[iteration * 4 + n] = 0.0f;
    rms += del * del;
  }
}

Here, the global variable rms is used as a reduction variable to compute the rms value of the residual. The kernel can be outlined as follows:

//outlined elemental kernel - update
inline void update(const double *qold, double *q, double *res,
                   const double *adt, double *rms) {
  double del, adti;

  adti = 1.0f / (*adt);

  for (int n = 0; n < 4; n++) {
    del = adti * res[n];
    q[n] = qold[n] - del;
    res[n] = 0.0f;
    *rms += del * del;
  }
}

And then call it in the application:

//update = update flow field - iterates over cells
rms = 0.0f;
for (int iteration = 0; iteration < ncell; ++iteration) {
  update(&qold[iteration * 4], &q[iteration * 4], &res[iteration * 4],
         &adt[iteration], &rms);
}

The global reduction requires the op_arg_gbl API call with OP_INC access mode to indicate that the variable is a global reduction:

//update = update flow field - iterates over cells
rms = 0.0f;
op_par_loop(update, "update", cells,
            op_arg_dat(p_qold, -1, OP_ID, 4, "double", OP_READ ),
            op_arg_dat(p_q,    -1, OP_ID, 4, "double", OP_WRITE),
            op_arg_dat(p_res,  -1, OP_ID, 4, "double", OP_RW   ),
            op_arg_dat(p_adt,  -1, OP_ID, 1, "double", OP_READ ),
            op_arg_gbl(&rms,    1,           "double", OP_INC  ));

At this point all the loops have been converted to use op_par_loop API and the application should be validating when executed on as sequential, single threaded CPU application. You should now also be able to run without the OP_NO_REALLOC runtime flag and still get a valid result. However, in this case you should note that OP2 will be making internal copies of the data declared for op_map and op_dat . When developing applications for performance, you should consider freeing the initial memory allocated immediately after the relevant op_decl_map and op_decl_dat calls. In the next step we avoid freeing such “application developer allocated” memory by using HDF5 file I/O so that mesh data is directly read from file to OP2 allocated internal memory.

Step 6 - Handing it all to OP2

Once the developer sequential version has been created and the numerical output validates the application can be prepared to obtain a developer distributed memory parallel version. This step can be completed to obtain a parallel executable, without code-generation if the following steps are implemented.

File I/O needs to be extended to allow distributed memory execution with MPI. The current Airfoil application simply reads the mesh data from a text file and such a simple setup will not be workable on a distributed memory system, such as a cluster and more importantly will not be scalable with MPI. The simplest solution is to use OP2’s HDF5 API for declaring the mesh by replacing op_decl_set, op_decl_map, op_decl_dat and op_decl_const by its HDF5 counterparts as follows:

// declare sets
op_set nodes  = op_decl_set_hdf5(file,  "nodes" );
op_set edges  = op_decl_set_hdf5(file,  "edges" );
op_set bedges = op_decl_set_hdf5(file, "bedges");
op_set cells  = op_decl_set_hdf5(file,  "cells" );

//declare maps
op_map pedge   = op_decl_map_hdf5(edges,  nodes, 2, file, "pedge"  );
op_map pecell  = op_decl_map_hdf5(edges,  cells, 2, file, "pecell" );
op_map pbedge  = op_decl_map_hdf5(bedges, nodes, 2, file, "pbedge" );
op_map pbecell = op_decl_map_hdf5(bedges, cells, 1, file, "pbecell");
op_map pcell   = op_decl_map_hdf5(cells,  nodes, 4, file, "pcell"  );

//declare data on sets
op_dat p_bound = op_decl_dat_hdf5(bedges, 1, "int",    file, "p_bound");
op_dat p_x     = op_decl_dat_hdf5(nodes,  2, "double", file, "p_x"    );
op_dat p_q     = op_decl_dat_hdf5(cells,  4, "double", file, "p_q"    );
op_dat p_qold  = op_decl_dat_hdf5(cells,  4, "double", file, "p_qold" );
op_dat p_adt   = op_decl_dat_hdf5(cells,  1, "double", file, "p_adt"  );
op_dat p_res   = op_decl_dat_hdf5(cells,  4, "double", file, "p_res"  );

//read and declare global constants
op_get_const_hdf5("gam",   1, "double", (char *)&gam,  file);
op_get_const_hdf5("gm1",   1, "double", (char *)&gm1,  file);
op_get_const_hdf5("cfl",   1, "double", (char *)&cfl,  file);
op_get_const_hdf5("eps",   1, "double", (char *)&eps,  file);
op_get_const_hdf5("alpha", 1, "double", (char *)&alpha,file);
op_get_const_hdf5("qinf",  4, "double", (char *)&qinf, file);

op_decl_const(1, "double", &gam  );
op_decl_const(1, "double", &gm1  );
op_decl_const(1, "double", &cfl  );
op_decl_const(1, "double", &eps  );
op_decl_const(1, "double", &alpha);
op_decl_const(4, "double", qinf  );

Note here that we assume that the mesh is already available as an HDF5 file named new_grid.h5. (See the convert_mesh.cpp utility application in OP2-Common/apps/c/airfoil/airfoil_hdf5/dp to understand how we can create an HDF5 file to be compatible with the OP2 API for Airfoil starting from mesh data defined in a text file.)

When the application has been switched to use the HDF5 API calls, manually allocated memory for the mesh elements can be removed. Additionally all printf statements should use op_printf so that output to terminal will only be done by the ROOT mpi process. We can also replace the timer routines with OP2’s op_timers which times the execution of the code the ROOT.

Given that the mesh was read via HDF5, to obtain the global sizes of the mesh, OP2’s op_get_size() API call need to be used. This is required for the Airfoil application to obtain the number of cells to compute the rms value for every 100 iterations to validate the application:

//get global number of cells
ncell = op_get_size(cells);

Add the OP2 partitioner call op_partition to the code in order to signal to the MPI back-end, the point in the program that all mesh data have been defined and mesh can be partitioned and MPI halos can be created:

...
...
op_decl_const(1, "double", &alpha);
op_decl_const(4, "double", qinf  );

//output mesh information
op_diagnostic_output();

//partition mesh and create mpi halos
op_partition("BLOCK", "ANY", edges, pecell, p_x);

...
...

See the API documentation for practitioner options. In this case no special partitioner is used leaving the initial block partitioning of data at the time of file I/O through HDF5.

Take a look at the code in the /step6 for the full code changes done to the Airfoil application. The application can now be compiled to obtain a developer distributed-memory (MPI) parallel executable using the Makefile in the same directory. Note how the executable is created by linking with the OP2 MPI back-end, libop2_mpi together with the HDF5 library libhdf5. You will need to have had HDF5 library installed on your system to carry out this step.

The resulting executable is called a developer MPI version of the application, which should again be used to verify validity of the application by running with mpirun in the usual way of executing an MPI application.

Step 7 - Code generation

Now that both the sequential and MPI developer versions work and validate, its time to generate other parallel versions. However, first we should move the elemental kernels to header files so that after the code generation the modified main application will not have the same elemental kernel definitions. This is currently a limitation of the code-generator, which will be remedied in future versions.

We move the elemental kernels to header files, each with the name of the kernel and include them in the airfoil_step7.cpp main file:

...
...
/* Global Constants */
double gam, gm1, cfl, eps, mach, alpha, qinf[4];

//
// kernel routines for parallel loops
//
#include "adt_calc.h"
#include "bres_calc.h"
#include "res_calc.h"
#include "save_soln.h"
#include "update.h"
...

Now the code is ready for code-generation, go to the /step7 directory and on the terminal type :

python $OP2_INSTALL_PATH/../translator/c/op2.py airfoil_step7.cpp

Note that the python command assumes that it will point to Python 3.* or above. This will then generate (in the same directory) parallel code under sub-directories - cuda, openacc, openmp, openmp4, seq and vec. They correspond to on-node parallel version based on CUDA, OpenACC, OpenMP, OpenMP4.0 (and higher) and SIMD vectorized, respectively. Each of them can also run in combination with distributed memory parallelization using MPI across nodes.

The Makefile in /step7 simply uses the OP2 supplied makefiles:

APP_NAME := airfoil_step7

APP_ENTRY := $(APP_NAME).cpp
APP_ENTRY_MPI := $(APP_ENTRY)

OP2_LIBS_WITH_HDF5 := true

include ../../../../../makefiles/common.mk
include ../../../../../makefiles/c_app.mk

This will build all the currently supported parallel versions, provided that the supporting libraries are available on your system and the relevant OP2 back-end libraries have been built.

Final - Code generated versions and execution

The following parallel versions will be generated from the code generator. These and the previously developed developer versions and can be executed as follows (see the /final directory in the OP2-APPS repository for all the generated code) :

Developer sequential and developer mpi - no code-generation required.

#developer sequential
./aifroil_seq

#developer distributed memory with mpi, on 4 mpi procs
$MPI_INSTAL_PATH/bin/mpirun -np 4 ./airfoil_mpi_seq

Code-gen sequential and MPI + Code-gen sequential

# code-gen sequential
./aifroil_genseq
#On 4 mpi procs
$MPI_INSTAL_PATH/bin/mpirun -np 4 ./aifroil_mpi_genseq

Code-gen OpenMP, on 4 OpenMP threads, with mini-partition size of 256 and MPI + Code-gen OpenMP, on 4 MPI x 8 OpenMP with with mini-partition size of 256

# on 4 OMP threads
export OMP_NUM_THREADS=4; ./aifroil_openmp OP_PART_SIZE=256
#On 4 mpi procs with each proc running 8 OpenMP threads
export OMP_NUM_THREADS=8; $MPI_INSTAL_PATH/bin/mpirun -np 4 ./aifroil_mpi_openmp with mini-partition size of 256

Code-gen SIMD vectorized and MPI + Code-gen SIMD vectorized, on 4 MPI

#SIMD vec
./aifroil_vec
#On 4 mpi procs with each proc running SIMD vec
$MPI_INSTAL_PATH/bin/mpirun -np 4 ./aifroil_mpi_vec

Code-gen CUDA with mini-partition size of 128 and CUDA thread-block size of 192 and MPI + Code-gen CUDA

#On a single GPU
./airfoil_cuda OP_PART_SIZE=128 OP_BLOCK_SIZE=192
#On 4 mpi procs, each proc having a GPU
$MPI_INSTALL_PATH/bin/mpirun -np 4 ./airfoil_mpi_cuda OP_PART_SIZE=128 OP_BLOCK_SIZE=192

MPI + Code-gen hybrid CUDA with mini-partition size of 128 and CUDA thread-block size of 192

The hybrid version can be run on both CPUs and GPUs at the same time. If there is only 4 GPUs are available the following execution will allocated 4 MPI procs to be run on 4 GPUs and 8 MPI procs allocated to the remaining CPU cores.

#On 12 mpi procs
$MPI_INSTALL_PATH/bin/mpirun -np 12 ./airfoil_mpi_cuda_hyb OP_PART_SIZE=128 OP_BLOCK_SIZE=192

Code-gen OpenACC with mini-partition size of 128 and thread-block size of 192 and MPI + Code-gen OpenACC

#On a single GPU
./airfoil_openacc OP_PART_SIZE=128 OP_BLOCK_SIZE=192
#On 4 mpi procs, each proc having a GPU
$MPI_INSTALL_PATH/bin/mpirun -np 4 ./airfoil_mpi_openacc OP_PART_SIZE=128 OP_BLOCK_SIZE=192

Optimizations

See the Performance Tuning section for a number of specific compile-time and runtime flags to obtain better performance.