, Mat new_speed_up, new_speed_up_inv

, selectPointsToRecalcFlow(flow, averaging_radius, (float) speed_up_thr, 623 curr_rows, curr_cols, speed_up, new_speed_up, mask)

, 626 (float) speed_up_thr, curr_rows, curr_cols, speed_up_inv, 627 new_speed_up_inv, mask_inv)

, flow_inv = upscaleOpticalFlow(curr_rows, curr_cols, prev_to, 637 confidence_inv, flow_inv, upscale_averaging_radius, 638 (float) upscale_sigma_dist, (float) upscale_sigma_color)

, calcOpticalFlowSingleScaleSF(curr_from_extended, curr_to_extended, mask, 642 flow, averaging_radius, max_flow, (float) sigma_dist, 643 (float) sigma_color)

, calcConfidence(curr_to, curr_from, flow_inv, confidence_inv, max_flow)

, 646 calcOpticalFlowSingleScaleSF(curr_to_extended, curr_from_extended, 647 mask_inv, flow_inv, averaging_radius, max_flow, 648 (float) sigma_dist, (float) sigma_color)

, extrapolateFlow(flow, speed_up)

, 651 extrapolateFlow(flow_inv, speed_up_inv)

/. Todo, should we remove occlusions for the last stage? 654 removeOcclusions(flow, flow_inv

, 655 removeOcclusions(flow_inv, flow, (float) occ_thr, confidence_inv)

, crossBilateralFilter(flow, curr_from, confidence, flow, postprocess_window, 659 (float) sigma_color_fix, (float) sigma_dist_fix)

. Gaussianblur, , vol.5

, CV_32FC2

, 664 Mat resulted_flow = _resulted_flow.getMat(

, 666 mixChannels(&flow, 1, &resulted_flow, 1, from_to, 2)

, CV_EXPORTS_W void calcOpticalFlowSF, p.670

, OutputArray flow, int layers, int averaging_block_size, int max_flow) { 671 orig_calcOpticalFlowSF(from, to, flow, layers, averaging_block_size, max_flow

, Évolution du nombre de publications référencées par Google Scholar pour les mots clés GPU et GPGPU

, Exemple de vulgarisation comparant les architectures CPU et GPU. Source : Nvidia

, Répartition des capteurs embarqués sur la voiture model S

, Évolution des performances maximales de différentes architectures au cours du temps. Le graphique du haut reprèsente les performances calculatoires, celui du bas la bande passante mémoire, vol.9

, Vue globale de l'architecture Nvidia Pascal -GP104 utilisée pour les GTX 1080

. .. Vue-d'un-multi-processeur-sm-de-l'architecture-nvidia-pascal, , p.13

, Vue macroscopique de la méthodologie de placement d'algorithmes sur architecture hybride CPU et GPU

, Détails de la phase d'analyse de code statique

, Représentation spinale de la fonction removeOcclusions, p.52

, Représentation spinale de la fonction removeOcclusions, p.55

. .. , Détails de la phase d'analyse de code dynamique, p.57

. .. Simt, Transformations de nid de boucles pour architectures, vol.58

E. De-pattern-de-nid-de-boucles-pour and G. .. , , p.58

, Extrait de représentation spinale pour la fonction crossBilateralFilter (1/2) 66

, Extrait de représentation spinale pour la fonction crossBilateralFilter, p.67

, Génération de code source pour hôte et accélérateur de type GPU, p.82

, Représentation spinale de la fonction calcIrregularityMat où les blocs b 1 et b 2 ne permettent pas d'avoir des boucles parfaitement imbriquées, p.84

, Déplacement de blocs encastrés pour la fonction calcIrregularityMat. (Méthode par exclusion)

, Déplacement de blocs encastrés pour la fonction calcIrregularityMat. (Méthode par inclusion), vol.87

, Déplacement de blocs interboucles pour la fonction calcIrregularityMat. (Méthode par inclusion et synchronisation)

, Vue d'un cluster SMX de l'architecture Nvidia Kepler de première génération utilisée pour les Quadro K2000

, Vue d'un cluster SMM de l'architecture Nvidia Maxwel de seconde génération utilisée pour la Tegra X1

, Exécution de l'algorithme original simpleFlow

, Placement initial de l'algorithme Simpleflow sur le GPU de la Jetson TX1, vol.106

, Placement initial de l'algorithme Simpleflow sur le GPU d'Endicott, p.107

, Extrait de représentation spinale pour la fonction crossBilateralFilter (1/2) 108

, Extrait de représentation spinale pour la fonction crossBilateralFilter (2/2) 109

, Extrait de représentation spinale pour la fonction calcOpticalFlowSinglsS-caleSF (1/2)

, Extrait de représentation spinale pour la fonction calcOpticalFlowSinglsS-caleSF (2/2)

, Amélioration de la quantité de placement sur le GPU de la Jetson

A. De and L. .. Endicott, , p.115

, Temps d'exécution de l'algorithme de variance locale en fonction de la taille du voisinage

, Temps d'accès moyen en lecture pour une distribution cyclique des accès mémoire sur Nvidia Quadro K2000. Fonction d'accès : R 1, p.132

, Temps d'accès moyen en lecture pour une distribution par blocs des accès mémoire sur Nvidia Quadro K2000. Fonction d'accès : R 2, p.133

, Temps d'accès moyen en lecture pour une distribution cyclique des accès mémoire sur Nvidia Quadro K2000. Fonction d'accès : R 1, p.134

, Temps d'accès moyen en lecture pour une distribution par blocs des accès mémoire sur Nvidia Quadro K2000. Fonction d'accès : R 2, p.135

, Temps d'accès moyen en lecture pour une distribution cyclique des accès mémoire sur Nvidia TX1. Fonction d'accès : R 1 . Référentiel : Block, p.138

, Temps d'accès moyen en lecture pour une distribution par blocs des accès mémoire sur Nvidia TX1. Fonction d'accès : R 2 . Référentiel : Block, p.139

, Temps d'accès moyen en lecture pour une distribution cyclique des accès mémoire sur Nvidia TX1. Fonction d'accès : R 1 . Référentiel : Warp, p.140

, Temps d'accès moyen en lecture pour une distribution par blocs des accès mémoire sur Nvidia Quadro K2000. Fonction d'accès : R 2, p.141

, Analyse de la concurrence de kernels intra-GPU sur architecture Nvidia

B. ;. , Représentation spinale du programme simpleflow (1/18), p.174

B. ;. , Représentation spinale du programme simpleflow (2/18), p.175

B. ;. , Représentation spinale du programme simpleflow (3/18), p.176

B. ;. , Représentation spinale du programme simpleflow (4/18), p.177

B. ;. , Représentation spinale du programme simpleflow (5/18), p.178

B. ;. , Représentation spinale du programme simpleflow (6/18), p.179

B. ;. , Représentation spinale du programme simpleflow (7/18), p.180

B. ;. , Représentation spinale du programme simpleflow (8/18), p.181

B. ;. , Représentation spinale du programme simpleflow (9/18), p.182

B. ;. , Représentation spinale du programme simpleflow (10/18), p.183

B. ;. , Représentation spinale du programme simpleflow (11/18), p.184

B. ;. , Représentation spinale du programme simpleflow (12/18), p.185

B. ;. , Représentation spinale du programme simpleflow (13/18), p.186

B. ;. , Représentation spinale du programme simpleflow (14/18), p.187

B. ;. , Représentation spinale du programme simpleflow (15/18), p.188

B. ;. , Représentation spinale du programme simpleflow (16/18), p.189

B. ;. , Représentation spinale du programme simpleflow (17/18), p.190

B. ;. , Représentation spinale du programme simpleflow (18/18), p.191

. .. Gpu, Tableau récapitulatif des solutions de placement pour, p.42

, Tableau récapitulatif des architectures expérimentales utilisées, p.99

, Résultats de l'expérimentation sur la concurrence de threads, p.150

D. , Temps d'exécution de l'algorithme simpleflow original sur la Tegra X1, p.202

, Temps d'exécution de l'algorithme simpleflow suite à son placement initial sur le GPU de la Tegra X1

, Temps d'exécution de l'algorithme simpleflow suite à l'amélioration de la quantité de placement sur le GPU de la Tegra X1

, Temps d'exécution de l'algorithme simpleflow original sur Endicott, p.211

, Temps d'exécution de l'algorithme simpleflow suite à son placement initial sur le GPU d'Endicott

, Temps d'exécution de l'algorithme simpleflow suite à l'amélioration de la quantité de placement sur le GPU d'Endicott

, Liste des codes source

. .. , Extrait de code provenant de l'algorithme simpleflow, vol.46

P. .. , , p.102

, Modèle de kernel utilisé pour l'évaluation du parallélisme coarse grain sur GPU

, 197 HMPP Hybrid Multicore Parallel Programming, vol.25, p.26

, HPC High Performance Computing. 10-12, vol.24, p.240

, IGP Integrated Graphics Processor, vol.10

, ILP Instruction Level Parallelism, vol.78, p.152

, Intel GMA Intel Graphics Media Accelerator, vol.10

, Intel IPL Intel Image Processing Library, vol.19

, Intel IPP Intel Integrated Performance Primitive, vol.19, p.20

, IR Internal Representation, p.41

, ISA Instruction Set Architecture, vol.4, p.127

, ISL Integer Set Library, vol.33

, JIT Just In Time, vol.16, p.37

, LIDAR LIght Detection And Ranging, vol.1

, MIMD Multiple Instructions on Multiple Data, vol.14, p.157

, MMX MultiMedia eXtension

, MPI Message Passing Interface, p.31

, MPPA Multi-Purpose Processor Array, p.155

, MSI Modified Shared Invalid, vol.35

, NASA National Aeronautics and Space Administration, vol.118

, NPP Nvidia Performance Primitive, vol.19, p.20

, NUMA Non Uniform Memory Access, p.145

C. Nvcc-nvidia and . Compiler, , vol.16, p.128

, NVPTX NVidia Parallel Thread eXecution, p.16

, OpenACC Open ACCelerators, p.30

, OpenCL Open Computing Language, vol.14, issue.20, pp.30-38

, OpenCLIPP OpenCL Image Processing Primitives, vol.20

, OpenCV Open Computer Vision, vol.19, p.157

, OpenGL Open Graphics Library, pp.15-20

, OpenGL ES OpenGL for Embedded System, vol.15

, OpenGL SC OpenGL for Safety Critical applications, p.15

, OpenMP Open Multi-Processing, vol.25, p.42

, OS Operating System, vol.16, p.155

, PCIe Peripheral Component Interconnect express, vol.99, p.155

, PET Polyhedral Extraction Tool, vol.33

, PGCD Plus Grand Commun Diviseur, vol.54

, PIPS Programming Integrated Parallel System, vol.31, p.32

, PPCG Polyhedral Parallel Code Generator, vol.32, p.124

E. Complexité-en,

, Dépôt des contributions à OpenCV

, Intel Integrated Performance Primitives

. Intel®-vtune?-amplifier,

K. Openvx,

. Nvidia-cub,

, Parallel computing toolbox

. Top500,

. Pgi-accelerator, The portland group, pgi fortran and c accelarator programming model, 2009.

R. Alfred-v-aho, . Sethi, and . Jeffrey-d-ullman, Compilers, principles, techniques, vol.7, p.9, 1986.

M. Akhloufi and . Campagna, Openclipp : Opencl integrated performance primitives library for computer vision applications, Proc. SPIE Electronic Imaging, pp.25-31, 2014.

M. Ålind, V. Mattias, C. W. Eriksson, and . Kessler, Blocklib : a skeleton library for cell broadband engine, Proceedings of the 1st international workshop on Multicore software engineering, pp.7-14, 2008.

Y. Allusse and P. Horain, Ankit Agarwal et Cindula Saipriyadarshan : Gpucv : an opensource gpu-accelerated framework forimage processing and computer vision, Proceedings of the 16th ACM international conference on Multimedia, pp.1089-1092, 2008.

M. Amini, Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators, vol.2012
URL : https://hal.archives-ouvertes.fr/pastel-00958033

M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton et al., Ronan Keryell et Pierre Villalon : PIPS is not (just) polyhedral software adding GPU code generation in PIPS, First International Workshop on Polyhedral Compilation Techniques (IMPACT 2011) in conjonction with CGO 2011, 2011.

M. Amini and F. Coelho, François Irigoin et Ronan Keryell : Static compilation analysis for host-accelerator communication optimization, International Workshop on Languages and Compilers for Parallel Computing, pp.237-251

. Springer, , 2011.

M. Amini, B. Creusillet, S. Even, R. Keryell, O. Goubier et al., Grégoire Péan et Pierre Villalon : Par4all : From convex array regions to heterogeneous computing, IMPACT 2012 : Second International Workshop on Polyhedral Compilation Techniques HiPEAC, 2012.

C. Augonnet, J. Clet-ortega, S. Thibault, and . Raymond, Namyst : Data-aware task scheduling on multi-accelerator based platforms, Parallel and Distributed Systems (ICPADS), pp.291-298, 2010.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, Concurrency and Computation : Practice and Experience, Special Issue : Euro-Par, StarPU : A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures, vol.23, pp.187-198, 2009.

U. Banerjee, Dependence analysis, vol.3, 1997.

M. Baskaran, . Ramanujam, and . Sadayappan, Automatic C-to-CUDA code generation for affine programs, Compiler Construction, pp.244-263, 2010.

C. Bastoul, Code generation in the polyhedral model is easier than you think, PACT'13 IEEE International Conference on Parallel Architecture and Compilation Techniques, pp.7-16, 2004.
URL : https://hal.archives-ouvertes.fr/hal-00017260

N. Bell and J. Hoberock, Thrust : A productivity-oriented library for cuda, vol.2, pp.359-371, 2011.

M. Bilodeau, Freia : Framework for embedded image applications, 2008.

U. Bondhugula, V. Bandishti, and I. Pananilath, Diamond tiling : Tiling techniques to maximize parallelism for stencil computations, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.5, pp.1285-1298, 2017.

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, and J. Ramanujam, Atanas Rountev et Ponnuswamy Sadayappan : Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model, International Conference on Compiler Construction, pp.132-146, 2008.

U. Bondhugula and A. Hartono, Jagannathan Ramanujam et Ponnuswamy Sadayappan : A practical automatic polyhedral parallelizer and locality optimizer, In ACM SIGPLAN Notices, vol.43, pp.101-113, 2008.

I. Buck, D. Foley, . Horn, P. Sugerman, M. Hanrahan et al., , 2003.

I. Buck, T. Foley, D. Horn, J. Sugerman, and K. Fatahalian, Mike Houston et Pat Hanrahan : Brook for gpus : stream computing on graphics hardware, ACM transactions on graphics, vol.23, pp.777-786, 2004.

F. Coelho and . François-irigoin, Api-compiling for image hardware accelerators technical report-mines paristech a/500/cri, 2012.

D. Sylvain-collange, A. Defour, and . Tisserand, Power consumption of gpus from a software perspective, International Conference on Computational Science, pp.914-923, 2009.

A. Collins, D. Grewe, and V. Grover, Sean Lee et Adriana Susnea : Nova : A functional language for data parallelism, Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, 2014.

L. T. Jay, L. Cornwall, . Howes, H. J. Paul, and . Kelly, Phil Parsonage et Bruno Nicoletti : High-performance simt code generation in an active visual effects library, Proceedings of the 6th ACM conference on Computing frontiers, pp.175-184, 2009.

B. Creusillet and . François-irigoin, Exact versus approximate array region analyses, International Workshop on Languages and Compilers for Parallel Computing, pp.86-100, 1996.

, Béatrice Creusillet et François Irigoin : Interprocedural array region analyses, International Journal of Parallel Programming, vol.24, issue.6, pp.513-546, 1996.

B. Creusillet and R. Keryell, Stéphanie Even, Serge Guelton et François Irigoin : Par4all : Auto-parallelizing c and fortran for the cuda architecture, 2009.

R. Cruz and L. Drummond, Esteban Clua et Cristiana Bentes : Analyzing and estimating the performance of concurrent kernels execution on gpus, 2017.

G. Research, Fourier-motzkin elimination and its dual, 1972.

A. George-b-dantzig, P. Orden, and . Wolfe, The generalized simplex method for minimizing a linear form under linear inequality restraints, Pacific Journal of Mathematics, vol.5, issue.2, pp.183-195, 1955.

A. Darte, On the complexity of loop fusion, Parallel Computing, vol.26, issue.9, pp.1175-1193, 2000.
URL : https://hal.archives-ouvertes.fr/hal-02101854

A. Darte and F. Vivien, On the optimality of allen and kennedy's algorithm for parallelism extraction in nested loops, European Conference on Parallel Processing, pp.379-388, 1996.

U. Dastgeer, J. Enmyren, W. Christoph, and . Kessler, Auto-tuning skepu : a multi-backend skeleton programming framework for multi-gpu systems, Proceedings of the 4th International Workshop on Multicore Software Engineering, pp.25-32, 2011.

U. Dastgeer and C. Kessler, Smart containers and skeleton programming for gpu-based systems, International journal of parallel programming, vol.44, issue.3, pp.506-530, 2016.

U. Dastgeer, C. W. Kessler, and S. Thibault, Flexible runtime support for efficient skeleton programming on heterogeneous gpu-based systems, PARCO, pp.159-166, 2011.

U. Dastgeer, L. Li, and C. Kessler, Adaptive implementation selection in the skepu skeleton programming library, International Workshop on Advanced Parallel Processing Technologies, pp.170-183, 2013.

M. De-michiel and A. Bonenfant, Hugues Cassé et Pascal Sainrat : Loop normalization (suite)

V. Vassilios, E. Dimakopoulos, G. Leontiadis, and . Tzoumas, Proc. EWOMP, pp.5-11, 2003.

R. Dolbeau, S. Bihan, and F. Bodin, Hmpp : A hybrid multicore parallel programming environment, Workshop on general purpose processing on graphics processing units, vol.28, 2007.

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas et al., Flownet : Learning optical flow with convolutional networks, Proceedings of the IEEE international conference on computer vision, pp.2758-2766, 2015.

J. Enmyren, U. Dastgeer, W. Christoph, and . Kessler, Towards a tunable multi-backend skeleton programming framework for multi-gpu systems, Proceedings of the 3rd Swedish Workshop on Multicore Computing, 2010.

J. Enmyren, W. Christoph, and . Kessler, Skepu : a multi-backend skeleton programming library for multi-gpu systems, Proceedings of the fourth international workshop on High-level parallel programming and applications, pp.5-14, 2010.

. Caps-enterprise, Hmpp : A hybrid multicore parallel programming platform

A. Ernstsson, Skepu 2 : language embedding and compiler support for flexible and type-safe skeleton programming, 2016.

A. Ernstsson, Skepu 2 user guide, 2016.

A. Ernstsson, L. Li, and C. Kessler, Skepu 2 : Flexible and type-safe skeleton programming for heterogeneous parallel systems, International Journal of Parallel Programming, pp.1-19, 2017.

J. Farrugia and P. Horain, Erwan Guehenneux et Yannick Alusse : Gpucv : A framework for image processing acceleration with graphics processors, Multimedia and Expo, 2006 IEEE International Conference on, pp.585-588, 2006.

P. Feautrier, Dataflow analysis of array and scalar references, International Journal of Parallel Programming, vol.20, issue.1, pp.23-53, 1991.

M. Flynn, Flynn's taxonomy, Encyclopedia of parallel computing, pp.689-697, 2011.

A. Fraboulet, K. Kodary, and A. Mignotte, Loop fusion for memory space optimization, Proceedings of the 14th international symposium on Systems synthesis, pp.95-100, 2001.
URL : https://hal.archives-ouvertes.fr/hal-00399639

F. Gouin, Performance optimization and profiling of image processing algorithms on parallel architectures, 2018.

F. Gouin, C. Ancourt, and C. Guettier, Méthode de calcul de variance locale adaptée aux processeurs graphiques, COMPAS2016, Conférence d'informatique en Parallélisme, 2016.

F. Gouin, C. Ancourt, and C. Guettier, Threewise : a local variance algorithm for gpu, 19th IEEE International Conference on Computational Science and Engineering (CSE 2016), pp.257-262, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01359482

F. Gouin, C. Ancourt, and C. Guettier, An up to date Mapping Methodology for GPUs, 20th Workshop on Compilers for Parallel Computing (CPC 2018), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01759238

T. Grosser, A. Cohen, and J. Holewinski, Ponuswamy Sadayappan et Sven Verdoolaege : Hybrid hexagonal/classical tiling for gpus, Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, p.66, 2014.

T. Grosser, S. Verdoolaege, A. Cohen, and . Sadayappan, The relation between diamond tiling and hexagonal tiling. Parallel Processing Letters, vol.24, p.1441002, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01257249

S. Guelton, François Irigoin et Ronan Keryell : Compilation for heterogeneous computing : Automating analyses, transformations and decisions, 2011.

M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron, Enabling task parallelism in the cuda scheduler, Workshop on Programming Models for Emerging Architectures, vol.9, 2009.

P. Guillou and B. Pin, Fabien Coelho et François Irigoin : A dynamic to static dsl compiler for image processing applications, 2017.

T. Han, Directive-Based General-Purpose GPU Programming, 2009.

T. David-han and . Tarek-s-abdelrahman, hi cuda : a high-level directive-based language for gpu programming, Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp.52-61, 2009.

T. David-han and . Tarek-s-abdelrahman, hicuda : High-level gpgpu programming. IEEE Transactions on Parallel and Distributed systems, vol.22, pp.78-90, 2011.

M. Harris, S. Sengupta, D. John, and . Owens, Parallel prefix sum (scan) with cuda. GPU gems, vol.3, pp.851-876, 2007.

L. John, . Hennessy, A. David, and . Patterson, Computer architecture : a quantitative approach, 2011.

W. Daniel-hillis, . Guy, and . Steele, Data parallel algorithms, vol.29, pp.1170-1183, 1986.

E. Ilg, N. Mayer, T. Saikia, and M. Keuper, Alexey Dosovitskiy et Thomas Brox : Flownet 2.0 : Evolution of optical flow estimation with deep networks, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.2462-2470, 2017.

, Intel IPL : Intel® Image Processing Library, Reference Manual, 2000.

F. Irigoin, M. Amini, C. Ancourt, and F. Coelho, Béatrice Creusillet et Ronan Keryell : Polyedres et compilation, Rencontres francophones du Parallélisme (RenPar'20), 2011.

A. Jangda and U. Bondhugula, An effective fusion and tile size model for optimizing image processing pipelines, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.261-275, 2018.

K. Kennedy, S. Kathryn, and . Mckinley, Maximizing loop parallelism and improving data locality via loop fusion and distribution, International Workshop on Languages and Compilers for Parallel Computing, pp.301-320, 1993.

R. Keryell, R. Reyes, and L. Howes, Khronos sycl for opencl : a tutorial, Proceedings of the 3rd International Workshop on OpenCL, p.24, 2015.

D. Wen-mei, Programming massively parallel processors : a hands-on approach, 2016.

L. Lamport, The parallel execution of do loops, Communications of the ACM, vol.17, issue.2, pp.83-93, 1974.

C. Lattner, Llvm and clang : Next generation compiler technology, The BSD Conference, pp.1-2, 2008.

C. Lattner and . Vikram-adve, Llvm : A compilation framework for lifelong program analysis & transformation, Proceedings of the international symposium on Code generation and optimization : feedback-directed and runtime optimization, p.75, 2004.

S. Lee, A. Troy, R. Johnson, and . Eigenmann, Cetus-an extensible compiler infrastructure for source-to-source transformation, International Workshop on Languages and Compilers for Parallel Computing, pp.539-553, 2003.

S. Lee, M. T. Manuel, V. Chakravarty, G. Grover, and . Keller, Gpu kernels as data-parallel array computations in haskell, Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods, pp.1-9, 2009.

S. Lee and R. Eigenmann, Openmpc : Extended openmp programming and tuning for gpus, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010.

S. Lee and R. Eigenmann, Openmpc : extended openmp for efficient programming and tuning on gpus, International Journal of Computational Science and Engineering, vol.8, issue.1, pp.4-20, 2013.

S. Lee, R. Seung-jai-min, and . Eigenmann, Openmp to gpgpu : a compiler framework for automatic translation and optimization, ACM Sigplan Notices, vol.44, issue.4, pp.101-110, 2009.

S. Lee, S. Jeffrey, and . Vetter, Openarc : extensible openacc compiler framework for directive-based accelerator programming study, Proceedings of the First Workshop on Accelerator Programming using Directives, pp.1-11, 2014.

J. Leng, T. Hetherington, A. Eltantawy, S. Gilani, N. S. Kim et al., Gpuwattch : enabling energy optimizations in gpgpus, ACM SIGARCH Computer Architecture News, vol.41, pp.487-498, 2013.

A. Leung, N. Vasilache, B. Meister, M. Baskaran, and D. Wohlford, Cédric Bastoul et Richard Lethin : A mapping path for multigpgpu accelerated computers from a portable high level programming abstraction, Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp.51-61, 2010.

C. Liao, O. Hernandez, B. Chapman, W. Chen, and W. Zheng, Openuh : An optimizing, portable openmp compiler, vol.19, pp.2317-2332, 2007.

C. Liao, J. Daniel, R. W. Quinlan, T. Vuduc, and . Panas, Effective source-to-source outlining to support whole program empirical optimization, LCPC, vol.9, pp.308-322, 2009.

. Amd-mantle, Mantle Programming Guide and API Reference, mars, 2015.

M. Marangoni and T. Wischgoll, Togpu : Automatic source transformation from c++ to cuda using clang/llvm. Electronic Imaging, vol.2016, pp.1-9, 2016.

M. Martineau, S. Mcintosh-smith, and W. Gaudin, Evaluating openmp 4.0's effectiveness as a heterogeneous parallel programming model, Parallel and Distributed Processing Symposium Workshops, pp.338-347, 2016.

I. Masliah, M. Baboulin, and J. Falcou, Meta-programming and multistage programming for gpgpus, 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC), pp.369-376, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01416797

S. Kathryn, S. Mckinley, C. Carr, and . Tseng, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems (TOPLAS), vol.18, issue.4, pp.424-453, 1996.

N. Megiddo and V. Sarkar, Optimal weighted loop fusion for parallel programs, Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, pp.282-291, 1997.

B. Meister, N. Vasilache, and D. Wohlford, Muthu Manikandan Baskaran, Allen Leung et Richard Lethin : R-stream compiler, Encyclopedia of Parallel Computing, pp.1756-1765, 2011.

S. Mittal, S. Jeffrey, and . Vetter, A survey of cpu-gpu heterogeneous computing techniques, ACM Computing Surveys (CSUR), vol.47, issue.4, p.69, 2015.

S. Mittal, S. Jeffrey, and . Vetter, A survey of methods for analyzing and improving gpu energy efficiency, ACM Computing Surveys (CSUR), vol.47, issue.2, p.19, 2015.

. Theodore-s-motzkin, G. Ernst, and . Straus, Maxima for graphs and a new proof of a theorem of turán, Canad. J. Math, vol.17, issue.4, pp.533-540, 1965.

V. Ravi-teja-mullapudi, U. Vasista, and . Bondhugula, Polymage : Automatic optimization for image processing pipelines, In ACM SIGARCH Computer Architecture News, vol.43, pp.429-443, 2015.

H. Nagasaka, N. Maruyama, and A. Nukada, Toshio Endo et Satoshi Matsuoka : Statistical power modeling of gpu kernels using performance counters, Green Computing Conference, pp.115-122, 2010.

, Gabriel Noaje : un environnement parallèle de développement haut niveau pour les accélérateurs graphiques : mise en oeuvre à l'aide d'OpenMP, 2013.

G. Noaje, C. Jaillet, and M. Krajecki, Source-to-source code translator : Openmp c to cuda, High Performance Computing and Communications (HPCC), pp.512-519, 2011.

, Cedric Nugteren : The bones source-to-source compiler manual, 2012.

C. Nugteren, A modular and parameterisable classification of algorithms, 2011.

C. Nugteren, Introducing'bones' : a parallelizing sourceto-source compiler based on algorithmic skeletons, Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, pp.1-10, 2012.

C. Nugteren, Bones : an automatic skeleton-based c-tocuda compiler for gpus, ACM Transactions on Architecture and Code Optimization (TACO), vol.11, issue.4, p.35, 2015.

C. Nugteren, H. Corporaal, and B. Mesman, Skeleton-based automatic parallelization of image processing algorithms for gpus, Embedded Computer Systems (SAMOS), 2011 International Conference on, pp.25-32, 2011.

C. Nugteren, R. Corvino, and H. Corporaal, Algorithmic species revisited : A program code classification based on array references, Multi-/Manycore Computing Systems (MuCoCoS), pp.1-8, 2013.

C. Nugteren, P. Custers, and H. Corporaal, Algorithmic species : a classification of affine loop nests for parallel programming, ACM Transactions on Architecture and Code Optimization (TACO), vol.9, issue.4, p.40, 2013.

C. Nugteren, P. Custers, and H. Corporaal, Automatic skeletonbased compilation through integration with an algorithm classification, International Workshop on Advanced Parallel Processing Technologies, pp.184-198

. Springer, , 2013.

C. Nvidia, CUDA Occupancy Calculator. NVIDIA

, CUDA Nvidia : Tuning CUDA applications for FERMI, 2010.

. Cuda-nvidia, Whitepaper NVIDIA Tegra X1, 2015.

. Cuda-nvidia, CUDA C best practices guid, 2017.

. Cuda-nvidia, CUDA C programming guide, 2017.

. Cuda-nvidia, CUDA compiler driver NVCC, 2017.

. Cuda-nvidia, , 2017.

, CUDA Nvidia : Parallel Thread Execution ISA, 2017.

. Cuda-nvidia, Thrust quick start guide, 2017.

, CUDA Nvidia : Tuning CUDA applications for KEPLER, 2017.

, CUDA Nvidia : Tuning CUDA applications for MAXWELL, 2017.

, CUDA Nvidia : Tuning CUDA applications for PASCAL, 2017.

, CUDA Nvidia : Tuning CUDA applications for VOLTA, 2017.

. Cuda-nvidia, Whitepaper : NVIDIA GeForce GTX 1080, 2017.

K. Pauwels and . Marc-m-van-hulle, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp.1-8, 2008.

A. Petreto, A. Hennequin, T. Koehler, T. Romera, Y. Fargeix et al., Energy and execution time comparison of optical flow algorithms on simd and gpu architectures, Conference on Design and Architectures for Signal and Image Processing (DASIP), pp.25-30, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01925886

A. Petreto, A. Hennequin, T. Koehler, T. Romera, Y. Fargeix et al., Quentin Meunier et Lionel Lacassagne : Comparaison de la consommation énergétique et du temps d'exécution d'un algorithme de traitement d'images optimisé sur des architectures SIMD et GPU, Conférence d'informatique en Parallélisme, 2018.

W. Pugh, The omega test : a fast and practical integer programming algorithm for dependence analysis, Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pp.4-13, 1991.

W. Pugh and D. Wonnacott, Going beyond integer programming with the omega test to eliminate false data dependences, IEEE Transactions on Parallel and Distributed Systems, vol.6, issue.2, pp.204-211, 1995.

A. Qasem and K. Kennedy, Profitable loop fusion and tiling using modeldriven empirical search, Proceedings of the 20th annual international conference on Supercomputing, pp.249-258, 2006.

D. Quinlan, Rose : Compiler support for object-oriented frameworks, Parallel Processing Letters, vol.10, issue.02n03, pp.215-226, 2000.

J. Dan and . Quinlan, , 2012.

R. Rugina and M. Rinard, Symbolic bounds analysis of pointers, array indices, and accessed memory regions, ACM Sigplan Notices, vol.35, pp.182-195, 2000.

V. Sarkar, Optimized unrolling of nested loops, Proceedings of the 14th international conference on Supercomputing, pp.153-166, 2000.

K. Sharad, K. S. Singhai, and . Mckinley, A parametrized loop fusion algorithm for improving parallelism and cache locality, The Computer Journal, vol.40, issue.6, pp.340-355, 1997.

O. Sjöström and C. Kessler, Skepu user guide. Rapport technique, 2015.

S. Yaya and T. Denis, Papiers présentés à la conférence renpar 2002, 2004.

M. Steuwer, P. Kegel, and S. Gorlatch, Skelcl-a portable skeleton library for high-level gpu programming, Parallel and Distributed Processing Workshops and Phd Forum, pp.1176-1182, 2011.

E. John, D. Stone, G. Gohara, and . Shi, Opencl : A parallel programming standard for heterogeneous computing systems, Computing in science & engineering, vol.12, issue.3, pp.66-73, 2010.

N. Sundaram, T. Brox, and K. Keutzer, Dense point trajectories by gpu-accelerated large displacement optical flow, European conference on computer vision, pp.438-451, 2010.

M. W. Tao, J. Bai, P. Kohli, and S. Paris, Simpleflow : A non-iterative, sublinear optical flow algorithm, Computer Graphics Forum, issue.2, p.31, 2012.

X. Tian, R. Xu, and . Chapman, Openuh : open source openacc compiler. GTC2014, 2014.

X. Tian, R. Xu, Y. Yan, Z. Yun, S. Chandrasekaran et al., Compiling a high-level directive-based programming model for gpgpus, International Workshop on Languages and Compilers for Parallel Computing, pp.105-120, 2013.

M. Sain-zee-ueng, . Lathara, S. Sara, . Baghsorkhi, and . Hwu-wen-mei, Cuda-lite : Reducing gpu programming complexity, vol.8, pp.1-15, 2008.

D. Unat, X. Cai, B. Scott, and . Baden, Mint : realizing cuda performance in 3d stencil methods with annotated c, Proceedings of the international conference on Supercomputing, pp.214-224, 2011.

N. Vasilache, B. Meister, M. Baskaran, and R. Lethin, Joint scheduling and layout optimization to enable multi-level vectorization, IMPACT, 2012.

S. Verdoolaege, isl : An integer set library for the polyhedral model, ICMS, vol.6327, pp.299-302, 2010.

S. Verdoolaege, M. Bruynooghe, G. Janssens, and . Catthoor, Multi-dimensional incremental loop fusion for data locality, Application-Specific Systems, Architectures, and Processors, pp.17-27, 2003.

S. Verdoolaege and T. Grosser, Polyhedral extraction tool, Second International Workshop on Polyhedral Compilation Techniques (IMPACT'12), 2012.

S. Verdoolaege and G. Janssens, , 2017.

S. Verdoolaege, J. C. Juega, and A. Cohen, José Ignacio Gómez, Christian Tenllado et Francky Catthoor : Polyhedral parallel code generation for cuda, ACM Trans. Archit. Code Optim, vol.9, issue.4, 2013.

V. Volkov, Better performance at lower occupancy, Proceedings of the GPU technology conference, vol.10, p.16, 2010.

V. Volkov, Understanding latency hiding on gpus, 2016.

D. Williams, V. Codreanu, P. Yang, B. Liu, F. Dong et al., Xia Zhao et Jos BTM Roerdink : Evaluation of autoparallelization toolkits for commodity gpus, International Conference on Parallel Processing and Applied Mathematics, pp.447-457, 2013.

S. Williams, A. Waterman, and D. Patterson, Roofline : an insightful visual performance model for multicore architectures, Communications of the ACM, vol.52, issue.4, pp.65-76, 2009.

M. Wolfe, More iteration space tiling, Proceedings of the 1989 ACM/IEEE conference on Supercomputing, pp.655-664, 1989.

M. Wolfe, Implementing the pgi accelerator model, Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp.43-50, 2010.

J. Wu, A. Belevich, E. Bendersky, M. Heffernan, C. Leary et al., Gpucc : An open-source gpgpu compiler, Proceedings of the 2016 International Symposium on Code Generation and Optimization, pp.105-116, 2016.

Y. Yang, P. Xiang, J. Kong, and H. Zhou, A gpgpu compiler for memory optimization and parallelism management, ACM Sigplan Notices, vol.45, pp.86-97, 2010.