In general, pieces of the grid corresponding to different teams are characterized by various sizes. The idea of improving the efficiency of block decomposition is shown in Figure 3. These advantages are achieved at the cost of some extra computations performed by teams. The results achieved for porting selected parts of EULAG to nontraditional architectures revealed a considerable potential in running scientific applications, including anelastic numerical models, on novel hardware architectures. Since each MPDATA block provides computations for all the stages, some extra calculations are required for each block.

Scientific Programming

For each platform, we use all the available cores with the vectorization enabled. Thus, blocks have to be extended by adequate halo areas.

The Intel Xeon Phi coprocessor includes processing cores, caches, memory controllers, PCIe client logic, and a very high bandwidth, bidirectional ring interconnect [ 2829 ].

In the basic, unoptimized implementation of the MPDATA algorithm Algorithm 1every stage reads the required set of matrices from the main memory and writes results to the main memory after computation. Table of Contents Alerts. These advantages are achieved at the cost of some extra computations performed by teams.

The idea of block decomposition using the mixture of loop tiling and loop fusion techniques is shown in Algorithm 2. The first-order-accurate advection equation is approximated to second order in, andthrough defining the advection-diffusion equation.


The prime assumption here is to reduce a saturation of the main memory traffic. The achieved performance results provide the basis for further research on optimizing the distribution of the MPDATA calculation across all the computing resources of the Intel MIC architecture, taking into consideration features of its on-board memory, mpfata hierarchy, computing cores, and vector units.

Another contribution of the paper is a method for increasing efficiency of computations by reducing intercache communications. To receive news and mpdwta updates for Scientific Programming, enter your email address in the box below.

It allows us to ease the memory and communication bounds and better exploit the floating point efficiency of target computing platforms. All these features have a significant impact on the sustained performance.

In general, all these loops now are merged Algorithm 2 c. Another advantage of this approach is the possibility of reducing the main memory consumption because all the intermediate results are stored in the cache memory. Preliminary studies of porting anelastic numerical models to novel architectures, including hybrid CPU-GPU architectures, were carried out in works [ 1516 ].

Because of the intracache communications between tasks, the overall system performance depends on a suitable method of pinning the task to available cores. It should be noted that the efficiency of our adaptation scheme will increase when all the MPDATA stages will be included into the code.

The main assumption for using the temporal blocking method is that no other computations need to be performed between consecutive stencils or stages. The memory behavior of stencil codes related to their performance on Xeon Phi was the primary focus of paper [ 27 ], where different types of regular stencils were studied.


Moreover, this technique is also commonly used by compilers to make the execution of certain types of loops more efficient [ 32 ].

Hence, every piece is partitioned into some MPDATA blocks, where subsequent blocks are processed one by one, and each block of size is processed in parallel by the work team. This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The new ability to work with bit vectors enables the processing of operating on 16 float or 8 double elements, instead of a single one.

The vectorization is performed within -dimension, where the value of size is adjusted to the vector size.

Such an approach gives us the possibility to ease memory bounds by increasing the efficient cache reusing, and reducing the memory traffic associated with intermediate computations. Both techniques are most often used to maximize the operational intensity ratio, reduce the loop overheads, increase the instruction parallelism, and improve the cache locality [ 3334 ]. Sizes of halo areas are determined in three dimensions: