Hybrid OpenMP/MPI programs for solving the time-dependent Gross–Pitaevskii equation in a fully anisotropic trap

We present hybrid OpenMP/MPI (Open Multi-Processing/Message Passing Interface) parallelized versions of earlier published C programs (Vudragović et al. 2012) for calculating both stationary and non-stationary solutions of the time-dependent Gross–Pitaevskii (GP) equation in three spatial dimensions....

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Computer physics communications Ročník 200; s. 411 - 417
Hlavní autoři: Satarić, Bogdan, Slavnić, Vladimir, Belić, Aleksandar, Balaž, Antun, Muruganandam, Paulsamy, Adhikari, Sadhan K.
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 01.03.2016
Témata:
ISSN:0010-4655, 1879-2944
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:We present hybrid OpenMP/MPI (Open Multi-Processing/Message Passing Interface) parallelized versions of earlier published C programs (Vudragović et al. 2012) for calculating both stationary and non-stationary solutions of the time-dependent Gross–Pitaevskii (GP) equation in three spatial dimensions. The GP equation describes the properties of dilute Bose–Einstein condensates at ultra-cold temperatures. Hybrid versions of programs use the same algorithms as the C ones, involving real- and imaginary-time propagation based on a split-step Crank–Nicolson method, but consider only a fully-anisotropic three-dimensional GP equation, where algorithmic complexity for large grid sizes necessitates parallelization in order to reduce execution time and/or memory requirements per node. Since distributed memory approach is required to address the latter, we combine MPI programming paradigm with existing OpenMP codes, thus creating fully flexible parallelism within a combined distributed/shared memory model, suitable for different modern computer architectures. The two presented C/OpenMP/MPI programs for real- and imaginary-time propagation are optimized and accompanied by a customizable makefile. We present typical scalability results for the provided OpenMP/MPI codes and demonstrate almost linear speedup until inter-process communication time starts to dominate over calculation time per iteration. Such a scalability study is necessary for large grid sizes in order to determine optimal number of MPI nodes and OpenMP threads per node. Program title: GP-SCL-HYB package, consisting of: (i) imagtime3d-hyb, (ii) realtime3d-hyb. Catalogue identifier: AEDU_v3_0 Program Summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDU_v3_0.html Program obtainable from: CPC Program Library, Queen’s University of Belfast, N. Ireland. Licensing provisions: Apache License 2.0 No. of lines in distributed program, including test data, etc.: 26397. No. of bytes in distributed program, including test data, etc.: 161195. Distribution format: tar.gz. Programming language: C/OpenMP/MPI. Computer: Any modern computer with C language, OpenMP- and MPI-capable compiler installed. Operating system: Linux, Unix, Mac OS X, Windows. RAM: Total memory required to run programs with the supplied input files, distributed over the used MPI nodes: (i) 310 MB, (ii) 400 MB. Larger grid sizes require more memory, which scales with Nx*Ny*Nz. Number of processors used: No limit, from one to all available CPU cores can used on all MPI nodes. Number of nodes used: No limit on the number of MPI nodes that can be used. Depending on the grid size of the physical problem and communication overheads, optimal number of MPI nodes and threads per node can be determined by a scalability study for a given hardware platform. Classification: 2.9, 4.3, 4.12. Catalogue identifier of previous version: AEDU_v2_0. Journal reference of previous version: Comput. Phys. Commun. 183 (2012) 2021. Does the new version supersede the previous version?: No. Nature of problem: These programs are designed to solve the time-dependent Gross–Pitaevskii (GP) nonlinear partial differential equation in three spatial dimensions in a fully anisotropic trap using a hybrid OpenMP/MPI parallelization approach. The GP equation describes the properties of a dilute trapped Bose–Einstein condensate. Solution method: The time-dependent GP equation is solved by the split-step Crank–Nicolson method using discretization in space and time. The discretized equation is then solved by propagation, in either imaginary or real time, over small time steps. The method yields solutions of stationary and/or non-stationary problems. Reasons for the new version: Previous C [1] and Fortran [2] programs are widely used within the ultracold atoms and nonlinear optics communities, as well as in various other fields [3]. This new version represents extension of the two previously OpenMP-parallelized programs (imagtime3d-th and realtime3d-th) for propagation in imaginary and real time in three spatial dimensions to a hybrid, fully distributed OpenMP/MPI programs (imagtime3d-hyb and realtime3d-hyb). Hybrid extensions of previous OpenMP codes enable interested researchers to numerically study Bose–Einstein condensates in much greater detail (i.e., with much finer resolution) than with OpenMP codes. In OpenMP (threaded) versions of programs, numbers of discretization points in X, Y, and Z directions are bound by the total amount of available memory on a single computing node where the code is being executed. New, hybrid versions of programs are not limited in this way, as large numbers of grid points in each spatial direction can be evenly distributed among the nodes of a cluster, effectively distributing required memory over many MPI nodes. This is the first reason for development of hybrid versions of 3d codes. The second reason for new versions is speedup in the execution of numerical simulations that can be gained by using multiple computing nodes with OpenMP/MPI codes. Summary of revisions: Two C/OpenMP programs in three spatial dimensions from previous version [1] of the codes (imagtime3d-th and realtime3d-th) are transformed and rewritten into a hybrid OpenMP/MPI programs and named imagtime3d-hyb and realtime3d-hyb. The overall structure of two programs is identical. The directory structure of the GP-SCL-HYB package is extended compared to the previous version and now contains a folder scripts, where examples of scripts that can be used to run the programs on a typical MPI cluster are given. The corresponding readme.txt file contains more details. We have also included a makefile with tested and verified settings for most popular MPI compliers, including OpenMPI (Open Message Passing Interface) [4] and MPICH (Message Passing Interface Chameleon) [5]. Transformation from pure OpenMP to a hybrid OpenMP/MPI approach has required that the array containing condensate wavefunction is distributed among MPI nodes of a computer cluster. Several data distribution models have been considered for this purpose, including block distribution and block cyclic distribution of data in a 2d matrix. Finally, we decided to distribute the wavefunction values across different nodes so that each node contains only one slice of the X-dimension data, while containing the complete corresponding Y- and Z-dimension data, as illustrated in Fig. 1. This allows central functions of our numerical algorithm, calcluy, calcuz, and calcnu to be executed purely in parallel on different MPI nodes of a cluster, without any overhead or communication, as nodes contain all the information for Y- and Z-dimension data in the given X-sub-domain. However, the problem arises when functions calclux, calcrms, and calcmuen need to be executed, as they also operate on the whole X-dimension data. Thus, the need for additional communication arises during the execution of the function calcrms, while in the case of functions calclux and calcmuen also the transposition of data between X- and Y-dimensions is necessary, while data in Z dimension have to stay contiguous. Transposition provides nodes with all the necessary X-dimension data to execute functions calclux and calcmuen. However, this needs to be done in each iteration of numerical algorithm, thus necessarily increasing communication overhead of the simulation. Transposition algorithms that were considered where the ones that account for greatest common divisor (GCD) between number of nodes in columns (designated by N) and rows (designated by M) of a cluster configured as 2d mash of nodes [6]. Two of such algorithms have been tested and tried for implementation: the case when GCD=1 and the case when GCD>1. The trivial situation N=M=1 is already covered by the previous, purely OpenMP programs, and therefore, without any loss of generality, we have considered only configurations with number of nodes in X-dimension satisfying N>1. Only the former algorithm (GCD=1) was found to be sound in case where data matrix is not a 2d, but a 3d structure. Latter case was found to be too demanding implementation-wise, since MPI functions and data-types are bound to certain limitations. Therefore, the algorithm with M=1 nodes in Y-dimension was implemented, as depicted by the wavefunction data structure in Fig. 1. [Display omitted] Implementation of the algorithm relies on a sliced distribution of data among the nodes, as explained in Fig. 2. This successfully solves the problem of large RAM consumption of 3d codes, which arises even for moderate grid sizes. However, it does not solve the question of data transposition between the nodes. In order to implement the most effective (GCD=1) transposition algorithm according to Ref. [6], we had to carry out block distribution of data within one data slice contained on a single node. This block distribution of data was done implicitly, i.e. data on one node have been put in a single 1d array (psi) of contiguous memory, in which Z-dimension has stride 1, Y-dimension has stride Nz, and X-dimension has stride Ny*Nz. This is different from previous implementation of the programs, where the wavefunction was represented by an explicit 3d array. This change was also introduced in order to more easily form user MPI datatypes, which allow for implicit block distribution of data, and represent 3d blocks of data within 1d data array. These blocks are then swapped between nodes, effectively performing the transposition in X–Y and Y–X directions. Together with transposition of blocks between the nodes, the block data also have to be redistributed. To illustrate how this works, let us consider example shown in Fig. 1(a), where one data block has size (Nx/gisze)*(Ny/gsize)*Nz. It represents one 3d data block, swapped between two nodes of a cluster (through one non-blocking MPI_Isend and one MPI_Ireceive operation), containing (Nx/gsize)*(Ny/gsize) 1d rods of contiguous Nz data. These rods themselves need to be transposed wit
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0010-4655
1879-2944
DOI:10.1016/j.cpc.2015.12.006