Skip to Main content Skip to Navigation
New interface

Recherche de similarité dans du code source

Abstract : Several phenomenas cause source code duplication like inter-project copying and adaptation or cloning inside a same project. Looking for code matches allows to factorize them inside a project or to highlight plagiarism cases. We study statical similarity retrieval methods on source code that may be transformed via edit operations like insertion, deletion, transposition, in- or out-lining of functions. Sequence similarity retrieval methods inspired from genomics are studied and adapted to find common chunks of tokenized source. After an explanation on alignment and n-grams lookup techniques, we present a factorization method that merge function call graphs of projects to a single graph with the creation of synthetic functions modeling nested matches. It relies on the use of suffix indexation structures to find repeated token factors. Syntax tree indexation is explored to handle huge code bases allowing to lookup similar sub-trees with their hash values computed via heterogeneous abstraction profiles. Exact copies of sub-trees close in their host trees may be merged to get approximate and extended matches. Before and after match retrieval, we define similarity metrics to preselect interesting code spots, refine the search process or enhance the human understanding of results
Document type :
Complete list of metadata

Cited literature [115 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Thursday, April 21, 2011 - 8:49:07 AM
Last modification on : Saturday, January 15, 2022 - 3:58:12 AM
Long-term archiving on: : Thursday, November 8, 2012 - 5:00:54 PM


Version validated by the jury (STAR)


  • HAL Id : tel-00587628, version 1


Michel Chilowicz. Recherche de similarité dans du code source. Autre [cs.OH]. Université Paris-Est, 2010. Français. ⟨NNT : 2010PEST1012⟩. ⟨tel-00587628⟩



Record views


Files downloads