Recherche de similarité dans du code source

Abstract : Several phenomenas cause source code duplication like inter-project copying and adaptation or cloning inside a same project. Looking for code matches allows to factorize them inside a project or to highlight plagiarism cases. We study statical similarity retrieval methods on source code that may be transformed via edit operations like insertion, deletion, transposition, in- or out-lining of functions. Sequence similarity retrieval methods inspired from genomics are studied and adapted to find common chunks of tokenized source. After an explanation on alignment and n-grams lookup techniques, we present a factorization method that merge function call graphs of projects to a single graph with the creation of synthetic functions modeling nested matches. It relies on the use of suffix indexation structures to find repeated token factors. Syntax tree indexation is explored to handle huge code bases allowing to lookup similar sub-trees with their hash values computed via heterogeneous abstraction profiles. Exact copies of sub-trees close in their host trees may be merged to get approximate and extended matches. Before and after match retrieval, we define similarity metrics to preselect interesting code spots, refine the search process or enhance the human understanding of results
Document type :
Theses
Complete list of metadatas

Cited literature [115 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/tel-00587628
Contributor : Abes Star <>
Submitted on : Thursday, April 21, 2011 - 8:49:07 AM
Last modification on : Wednesday, July 4, 2018 - 4:37:56 PM
Long-term archiving on : Thursday, November 8, 2012 - 5:00:54 PM

File

TH2010PEST1012_complete.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00587628, version 1

Citation

Michel Chilowicz. Recherche de similarité dans du code source. Autre [cs.OH]. Université Paris-Est, 2010. Français. ⟨NNT : 2010PEST1012⟩. ⟨tel-00587628⟩

Share

Metrics

Record views

703

Files downloads

3426