. Pragma-step-gridify, i2 ( dist =*; sched = owner ) ) 16 for ( i1 = 0; i1 < n ; i1 ++) 17 for ( i2 = n -1, p.18

. Pragma-step-gridify, i2 ( dist = block ; sched = ordered ) ) 21 for ( i1 = 0; i1 < n ; i1 ++) 22 for ( i2 = n -2; i2 >= 1, pp.2-23

P. Fortin, R. Habel, F. Jezequel, J. L. Lamotte, and . Scott, Deployment on GPUs of an Application in Computational Atomic Physics, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp.1359-1366, 2011.
DOI : 10.1109/IPDPS.2011.285

URL : https://hal.archives-ouvertes.fr/hal-01285671

R. Habel, P. Fortin, F. Jezequel, J. L. Lamotte, and . Scott, Numerical Validation and GPU Performance in Atomic Physics, Designing Scientific Applications on GPUs, 2013.

F. Silber-chaussumier, A. Muller, and R. Habel, Generating data transfers for distributed GPU parallel programs, Journal of Parallel and Distributed Computing, vol.73, issue.12, pp.1649-1660, 2013.
DOI : 10.1016/j.jpdc.2013.07.022

URL : https://hal.archives-ouvertes.fr/hal-00925733

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak et al., Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, Journal of Physics: Conference Series, vol.180, issue.1, 2009.
DOI : 10.1088/1742-6596/180/1/012037

G. M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, Proceedings of the April 18-20, 1967, spring joint computer conference on, AFIPS '67 (Spring), pp.483-485, 1967.
DOI : 10.1145/1465482.1465560

M. Amini, F. Coelho, F. Irigoin, and R. Keryell, Static Compilation Analysis for Host-Accelerator Communication Optimization, Languages and Compilers for Parallel Computing, pp.237-251, 2013.
DOI : 10.1007/978-3-642-36036-7_16

URL : https://hal.archives-ouvertes.fr/hal-00743496

M. Amini, B. Creusillet, S. Even, R. Keryell, O. Goubier et al., Par4all : From Convex Array Regions to Heterogeneous Computing, IMPACT 2012 : Second International Workshop on Polyhedral Compilation Techniques HiPEAC 2012, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00744733

C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu et al., TreadMarks: shared memory computing on networks of workstations, Computer, vol.29, issue.2, pp.18-28, 1996.
DOI : 10.1109/2.485843

C. Ancourt, F. Coelho, F. Irigoin, and R. Keryell, A Linear Algebra Framework for Static High Performance Fortran Code Distribution, Scientific Programming, pp.3-27, 1997.
DOI : 10.1155/1997/195689

C. Augonnet, J. Clet-ortega, S. Thibault, and R. Namyst, Data-Aware Task Scheduling on Multi-accelerator Based Platforms, 2010 IEEE 16th International Conference on Parallel and Distributed Systems, pp.291-298, 2010.
DOI : 10.1109/ICPADS.2010.129

URL : https://hal.archives-ouvertes.fr/inria-00523937

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU : a Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation : Practice and Experience, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

S. Baghdadi, A. Größlinger, and A. Cohen, Putting Automatic Polyhedral Compilation for GPGPU to Work, Proceedings of the 15th Workshop on Compilers for Parallel Computers (CPC'10), 2010.
URL : https://hal.archives-ouvertes.fr/inria-00551517

H. David, E. Bailey, . Barszcz, T. John, . Barton et al., The NAS parallel benchmarks, International Journal of High Performance Computing Applications, vol.5, issue.3, pp.63-73, 1991.

P. Banerjee, A. John, M. Chandy, E. W. Gupta, I. Hodges et al., The Paradigm compiler for distributed-memory multicomputers, Computer, vol.28, issue.10, pp.2837-2884, 1995.
DOI : 10.1109/2.467577

J. Muthu-manikandan-baskaran, P. Ramanujam, and . Sadayappan, Automatic C-to-CUDA code Generation for Affine Programs, Compiler Construction, pp.244-263, 2010.

A. Basumallik and R. Eigenmann, Towards automatic translation of OpenMP to MPI, Proceedings of the 19th annual international conference on Supercomputing , ICS '05, pp.189-198, 2005.
DOI : 10.1145/1088149.1088174

A. Basumallik, S. Min, and R. Eigenmann, Programming Distributed Memory Sytems Using OpenMP, 2007 IEEE International Parallel and Distributed Processing Symposium, pp.1-8, 2007.
DOI : 10.1109/IPDPS.2007.370397

M. Benabderrahmane, L. Pouchet, A. Cohen, and C. Bastoul, The Polyhedral Model Is More Widely Applicable Than You Think, Compiler Construction, pp.283-303, 2010.
DOI : 10.1007/978-3-642-11970-5_16

URL : https://hal.archives-ouvertes.fr/inria-00551087

R. Bolze, F. Cappello, E. Caron, M. Daydé, F. Desprez et al., Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed, International Journal of High Performance Computing Applications, vol.20, issue.4, pp.481-494, 2006.
DOI : 10.1177/1094342006070078

URL : https://hal.archives-ouvertes.fr/hal-00684943

D. Bonachea, GASNet Specification, V1.1, 2002.

M. Bourgoin, E. Chailloux, and J. Lamotte, SPOC : GPGPU Programming Through Stream Processing with OCaml. Parallel Processing Letters, p.2012
URL : https://hal.archives-ouvertes.fr/hal-00697257

M. Bourgoin, E. Chailloux, and J. Lamotte, Efficient Abstractions for GPGPU Programming, International Journal of Parallel Programming, vol.34, issue.5, pp.583-600, 2014.
DOI : 10.1007/s10766-013-0261-x

URL : https://hal.archives-ouvertes.fr/hal-01216144

F. Broquedis, J. Clet-ortega, S. Moreaud, N. Furmento, B. Goglin et al., hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pp.180-186, 2010.
DOI : 10.1109/PDP.2010.67

URL : https://hal.archives-ouvertes.fr/inria-00429889

J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell et al., Productive Cluster Programming with OmpSs, Euro-Par 2011 Parallel Processing, pp.555-566, 2011.
DOI : 10.1147/rd.515.0593

J. Bueno, X. Martorell, R. M. Badia, E. Ayguadé, and J. Labarta, Implementing OmpSs support for regions of data in architectures with multiple address spaces, Proceedings of the 27th international ACM conference on International conference on supercomputing, ICS '13, pp.359-368, 2013.
DOI : 10.1145/2464996.2465017

L. Bradford, D. Chamberlain, . Callahan, P. Hans, and . Zima, Parallel Programmability and the Chapel Language, International Journal of High Performance Computing Applications, vol.21, issue.3, pp.291-312, 2007.

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra et al., X10 : an Object-oriented Approach to Non-uniform Cluster Computing, ACM SIGPLAN Notices, issue.10, pp.40519-538, 2005.

L. Chen, L. Liu, S. Tang, L. Huang, Z. Jing et al., Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation, Languages and Compilers for Parallel Computing, pp.151-165, 2011.
DOI : 10.1007/978-3-642-03869-3_82

B. Creusillet and F. Irigoin, Interprocedural array region analyses, Languages and Compilers for Parallel Computing, pp.46-60, 1996.
DOI : 10.1007/BFb0014191

URL : https://hal.archives-ouvertes.fr/hal-00752611

L. Dagum and R. Menon, OpenMP: an industry standard API for shared-memory programming, IEEE Computational Science and Engineering, vol.5, issue.1, pp.46-55, 1998.
DOI : 10.1109/99.660313

F. Darema, The SPMD Model: Past, Present and Future, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.1-1, 2001.
DOI : 10.1007/3-540-45417-9_1

R. E. Diaconescu and H. P. Zima, An Approach To Data Distributions in Chapel, International Journal of High Performance Computing Applications, vol.21, issue.3, pp.313-335, 2007.
DOI : 10.1177/1094342007078451

R. Dolbeau, S. Bihan, and F. Bodin, HMPP : A Hybrid Multi-core Parallel Programming Environment, Workshop on General Purpose Processing on Graphics Processing Units, 2007.

J. Dongarra, T. Sterling, H. Simon, and E. Strohmaier, High-Performance Computing: Clusters, Constellations, MPPs, and Future Directions, Computing in Science and Engineering, vol.7, issue.2, pp.51-59, 2005.
DOI : 10.1109/MCSE.2005.34

A. Duarn, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell et al., OmpSs : A Proposal For Programming Heterogeneous Multi-Core Architectures. Parallel Processing Letters, pp.173-193, 2011.

R. Duncan, A survey of parallel computer architectures, Computer, vol.23, issue.2, pp.5-16, 1990.
DOI : 10.1109/2.44900

J. Fang, A. L. Varbanescu, and H. Sips, A Comprehensive Performance Comparison of CUDA and OpenCL, 2011 International Conference on Parallel Processing, pp.216-225, 2011.
DOI : 10.1109/ICPP.2011.45

P. Feautrier, Dataflow analysis of array and scalar references, International Journal of Parallel Programming, vol.24, issue.4, 1991.
DOI : 10.1007/BF01407931

M. Flynn, Some Computer Organizations and Their Effectiveness, IEEE Transactions on Computers, vol.21, issue.9, pp.948-960, 1972.
DOI : 10.1109/TC.1972.5009071

P. Fortin, R. Habel, F. Jezequel, J. L. Lamotte, and . Scott, Deployment on GPUs of an Application in Computational Atomic Physics, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp.1359-1366, 2011.
DOI : 10.1109/IPDPS.2011.285

URL : https://hal.archives-ouvertes.fr/hal-01285671

M. Frumkin, H. Jin, and J. Yan, Implementation of NAS Parallel Benchmarks in High Performance Fortran, 1998.

M. Frumkin, H. Jin, and J. Yan, Implementation of NAS Parallel Benchmarks in High Performance Fortran, 1998.

E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra et al., Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp.97-104, 2004.
DOI : 10.1007/978-3-540-30218-6_19

J. Garcia, E. Ayguadé, and J. Labarta, A novel approach towards automatic data distribution, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM) , Supercomputing '95, pp.78-78, 1995.
DOI : 10.1145/224170.224500

W. Gropp, MPICH2: A New Start for MPI Implementations, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2002.
DOI : 10.1007/3-540-45825-5_5

S. Guelton, M. Amini, and B. Creusillet, Beyond Do Loops: Data Transfer Generation with Convex Array Regions, Languages and Compilers for Parallel Computing, pp.249-263, 2013.
DOI : 10.1007/978-3-642-37658-0_17

URL : https://hal.archives-ouvertes.fr/hal-00742583

M. Gupta and P. Banerjee, Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers. Parallel and Distributed Systems, IEEE Transactions on, vol.3, issue.2, pp.179-193, 1992.

T. David, H. , and T. S. Abdelrahman, hiCUDA : A High-level Directivebased Language for GPU Programming, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp.52-61, 2009.

W. , D. Hillis, and G. L. Steele-jr, Data Parallel Algorithms, Communications of the ACM, vol.29, issue.12, pp.1170-1183, 1986.

P. Jay and . Hoeflinger, Extending OpenMP to Clusters. White Paper, Intel Corporation, 2006.

S. Hong and H. Kim, An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness

F. Irigoin, P. Jouvelot, and R. Triolet, Semantical Interprocedural Parallelization : An Overview of the PIPS Project, Proceedings of the 5th international conference on Supercomputing, ICS '91, pp.244-251, 1991.
URL : https://hal.archives-ouvertes.fr/hal-00984684

D. Kim, Parameterized and Multi-level Tiled Loop Generation

K. Kusano, S. Satoh, and M. Sato, Performance Evaluation of the Omni OpenMP Compiler, High Performance Computing, pp.403-414, 2000.
DOI : 10.1007/3-540-39999-2_39

P. Lee and Z. M. Kedem, Automatic data and computation decomposition on distributed memory parallel computers, ACM Transactions on Programming Languages and Systems, vol.24, issue.1, pp.1-50, 2002.
DOI : 10.1145/509705.509706

S. Lee, A. Troy, R. Johnson, and . Eigenmann, Cetus ??? An Extensible Compiler Infrastructure for Source-to-Source Transformation, Languages and Compilers for Parallel Computing, pp.539-553, 2004.
DOI : 10.1007/978-3-540-24644-2_35

S. Lee, S. Min, and R. Eigenmann, OpenMP to GPGPU, ACM SIGPLAN Notices, vol.44, issue.4, pp.101-110, 2009.
DOI : 10.1145/1594835.1504194

A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford et al., A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction, Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pp.51-61, 2010.
DOI : 10.1145/1735688.1735698

URL : https://hal.archives-ouvertes.fr/inria-00551084

J. Li and M. Chen, Index domain alignment: minimizing cost of cross-referencing between distributed arrays, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation, pp.424-433, 1990.
DOI : 10.1109/FMPC.1990.89493

J. Mellor-crummey, L. Adhianto, W. N. Scherer, I. , and G. Jin, A New Vision for Co-Array Fortran, Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, PGAS '09, pp.1-5, 2009.

J. M. Mellor-crummey, . Adve, S. Vikram, B. Broom, C. et al., Advanced optimization strategies in the Rice dHPF compiler, Concurrency and Computation : Practice and Experience, pp.741-767, 2002.
DOI : 10.1002/cpe.647

J. Merlin, D. Miles, and V. Schuster, Distributed OMP : Extensions to OpenMP for SMP Clusters, Second European Workshop on OpenMP (EWOMP), pp.14-15, 2000.

D. Millot, A. Muller, C. Parrot, and F. Silber-chaussumier, STEP: A Distributed OpenMP for Coarse-Grain Parallelism Tool, OpenMP in a New Era of Parallelism, pp.83-99, 2008.
DOI : 10.1007/978-3-540-79561-2_8

URL : https://hal.archives-ouvertes.fr/hal-01373120

D. Millot and A. Muller, Christian Parrot, and Frédérique Silber- Chaussumier. From OpenMP to MPI : First Experiments of the STEP Source-tosource Transformation Tool, The international Parallel Computing Conference (ParCo), pp.669-676, 2009.

E. Gordon and . Moore, Cramming More Components onto Integrated Circuits, 1965.

J. Carlos-mourino, J. María, P. Martín, R. González, and . Doallo, Dynamic Load-Balancing for the STEM-II Air Quality Model, Computational Science and Its Applications-ICCSA 2006, pp.701-710, 2006.

M. Nakao, J. Lee, T. Boku, and M. Sato, XcalableMP implementation and performance of NAS Parallel Benchmarks, Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS '10, pp.1-11, 2010.
DOI : 10.1145/2020373.2020384

M. Nakao, . Lee, . Boku, . Taisuke, and M. Sato, Productivity and Performance of Global-View Programming with XcalableMP PGAS Language, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp.402-409, 2012.
DOI : 10.1109/CCGrid.2012.118

J. Nieplocha, . Robertj, R. J. Harrison, and . Littlefield, Global arrays: A nonuniform memory access programming model for high-performance computers, The Journal of Supercomputing, vol.10, issue.2, pp.169-189, 1996.
DOI : 10.1007/BF00130708

W. Robert, J. Numrich, and . Reid, Co-Array Fortran for Parallel Programming, SIGPLAN Fortran Forum, vol.17, issue.2, pp.1-31, 1998.

D. John, D. Owens, N. Luebke, M. Govindaraju, J. Harris et al., A Survey of General-Purpose Computation on Graphics Hardware, Computer graphics forum, pp.80-113, 2007.

L. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, . Ramanujam et al., Hybrid Iterative and Model-driven Optimization in the Polyhedral Model The Polyhedral Benchmark suite, 2014.

T. J. Richardson and R. L. Urbanke, The capacity of low-density parity-check codes under message-passing decoding. Information Theory, IEEE Transactions on, vol.47, issue.2, pp.599-618, 2001.

T. Saif and M. Parashar, Understanding the Behavior and Performance of Non-blocking Communications in MPI, Euro-Par 2004 Parallel Processing, pp.173-182, 2004.
DOI : 10.1007/978-3-540-27866-5_22

J. Saltz, K. Crowley, R. Michandaney, and H. Berryman, Run-time scheduling and execution of loops on message passing machines, Journal of Parallel and Distributed Computing, vol.8, issue.4, pp.303-312, 1990.
DOI : 10.1016/0743-7315(90)90129-D

F. Silber-chaussumier, A. Muller, and R. Habel, Generating data transfers for distributed GPU parallel programs, Journal of Parallel and Distributed Computing, vol.73, issue.12, pp.1649-1660, 2013.
DOI : 10.1016/j.jpdc.2013.07.022

URL : https://hal.archives-ouvertes.fr/hal-00925733

E. John, D. Stone, G. Gohara, and . Shi, OpenCL : A Parallel Programming Standard for Heterogeneous Computing Systems Computing in science & engineering, p.66, 2010.

H. Sutter, The Free Lunch is Over : A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal, vol.30, issue.3, pp.202-210, 2005.

F. Trahay, E. Brunet, A. Denis, and R. Namyst, A multithreaded communication engine for multicore architectures, 2008 IEEE International Symposium on Parallel and Distributed Processing, pp.1-7, 2008.
DOI : 10.1109/IPDPS.2008.4536139

URL : https://hal.archives-ouvertes.fr/inria-00224999

M. Ujaldon, E. L. Zapata, B. M. Chapman, and H. P. Zima, Vienna-Fortran/HPF Extensions for Sparse and Irregular Problems and their Compilation. Parallel and Distributed Systems, IEEE Transactions on, vol.8, issue.10, pp.1068-1083, 1997.

G. Leslie and . Valiant, A Bridging Model for Parallel Computation, Communications of the ACM, vol.33, issue.8, pp.103-111, 1990.

R. F. Van-der-wijngaart and P. Wong, NAS Parallel Benchmarks Version 2.4, pp.2-007, 2002.

S. Verdoolaege, J. C. Juega, A. Cohen, J. I. Gómez, C. Tenllado et al., Polyhedral parallel code generation for CUDA, ACM Transactions on Architecture and Code Optimization, vol.9, issue.4, p.54, 2013.
DOI : 10.1145/2400682.2400713

URL : https://hal.archives-ouvertes.fr/hal-00786677

D. E. Thorsten-von-eicken, S. C. Culler, K. E. Goldstein, and . Schauser, Active Messages : A Mechanism for Integrating Communication and Computation, 25 Years of the International Symposia on Computer Architecture ISCA '98, pp.430-440, 1998.

S. Wienke, P. Springer, C. Terboven, and D. Mey, OpenACC ??? First Experiences with Real-World Applications, Euro-Par 2012 Parallel Processing, pp.859-870, 2012.
DOI : 10.1007/978-3-642-32820-6_85

S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands et al., The potential of the cell processor for scientific computing, Proceedings of the 3rd conference on Computing frontiers , CF '06, pp.9-20, 2006.
DOI : 10.1145/1128022.1128027

K. Yelick, D. Bonachea, W. Chen, P. Colella, K. Datta et al., Productivity and performance using partitioned global address space languages, Proceedings of the 2007 international workshop on Parallel symbolic computation, PASCO '07, pp.24-32, 2007.
DOI : 10.1145/1278177.1278183

T. Yuki and S. Rajopadhye, Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs, 2013.