Optimization of Monte Carlo Neutron Transport Simulations with Emerging Architectures

Yunsong Wang

Résumé

Monte Carlo (MC) neutron transport simulations are widely used in the nuclear community to perform reference calculations with minimal approximations. The conventional MC method has a slow convergence according to the law of large numbers, which makes simulations computationally expensive. Cross section computation has been identified as the major performance bottleneck for MC neutron code. Typically, cross section data are precalculated and stored into memory before simulations for each nuclide, thus during the simulation, only table lookups are required to retrieve data from memory and the compute cost is trivial. We implemented and optimized a large collection of lookup algorithms in order to accelerate this data retrieving process. Results show that significant speedup can be achieved over the conventional binary search on both CPU and MIC in unit tests other than real case simulations. Using vectorization instructions has been proved effective on many-core architecture due to its 512-bit vector units; on CPU this improvement is limited by a smaller register size. Further optimization like memory reduction turns out to be very important since it largely improves computing performance. As can be imagined, all proposals of energy lookup are totally memory-bound where computing units does little things but only waiting for data. In another word, computing capability of modern architectures are largely wasted. Another major issue of energy lookup is that the memory requirement is huge: cross section data in one temperature for up to 400 nuclides involved in a real case simulation requires nearly 1 GB memory space, which makes simulations with several thousand temperatures infeasible to carry out with current computer systems.In order to solve the problem relevant to energy lookup, we begin to investigate another on-the-fly cross section proposal called reconstruction. The basic idea behind the reconstruction, is to do the Doppler broadening (performing a convolution integral) computation of cross sections on-the-fly, each time a cross section is needed, with a formulation close to standard neutron cross section libraries, and based on the same amount of data. The reconstruction converts the problem from memory-bound to compute-bound: only several variables for each resonance are required instead of the conventional pointwise table covering the entire resolved resonance region. Though memory space is largely reduced, this method is really time-consuming. After a series of optimizations, results show that the reconstruction kernel benefits well from vectorization and can achieve 1806 GFLOPS (single precision) on a Knights Landing 7250, which represents 67% of its effective peak performance. Even if optimization efforts on reconstruction significantly improve the FLOP usage, this on-the-fly calculation is still slower than the conventional lookup method. Under this situation, we begin to port the code on GPGPU to exploit potential higher performance as well as higher FLOP usage. On the other hand, another evaluation has been planned to compare lookup and reconstruction in terms of power consumption: with the help of hardware and software energy measurement support, we expect to find a compromising solution between performance and energy consumption in order to face the "power wall" challenge along with hardware evolution.

L’accès aux données de base, que sont les sections efficaces, constitue le principal goulot d’étranglement aux performances dans la résolution des équations du transport neutronique par méthode Monte Carlo (MC). Ces sections efficaces caractérisent les probabilités de collisions des neutrons avec les nucléides qui composent le matériau traversé. Elles sont propres à chaque nucléide et dépendent de l’énergie du neutron incident et de la température du matériau. Les codes de référence en MC chargent ces données en mémoire à l’ensemble des températures intervenant dans le système et utilisent un algorithme de recherche binaire dans les tables stockant les sections. Sur les architectures many-coeurs (typiquement Intel MIC), ces méthodes sont dramatiquement inefficaces du fait des accès aléatoires à la mémoire qui ne permettent pas de profiter des différents niveaux de cache mémoire et du manque de vectorisation de ces algorithmes.Tout le travail de la thèse a consisté, dans une première partie, à trouver des alternatives à cet algorithme de base en proposant le meilleur compromis performances/occupation mémoire qui tire parti des spécificités du MIC (multithreading et vectorisation). Dans un deuxième temps, nous sommes partis sur une approche radicalement opposée, approche dans laquelle les données ne sont pas stockées en mémoire, mais calculées à la volée. Toute une série d’optimisations de l’algorithme, des structures de données, vectorisation, déroulement de boucles et influence de la précision de représentation des données, ont permis d’obtenir des gains considérables par rapport à l’implémentation initiale.En fin de compte, une comparaison a été effectué entre les deux approches (données en mémoire et données calculées à la volée) pour finalement proposer le meilleur compromis en termes de performance/occupation mémoire. Au-delà de l'application ciblée (le transport MC), le travail réalisé est également une étude qui peut se généraliser sur la façon de transformer un problème initialement limité par la latence mémoire (« memory latency bound ») en un problème qui sature le processeur (« CPU-bound ») et permet de tirer parti des architectures many-coeurs.

Optimization of Monte Carlo Neutron Transport Simulations with Emerging Architectures

Optimisation du code Monte Carlo neutronique à l’aide d’accélérateurs de calculs

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager