Extraction hybride et description structurelle de caractères pour une reconnaissance efficace de texte dans les documents hétérogènes scannés : Méthodes et Algorithmes parallèles

Abstract : The Optical Character Recognition (OCR) is a process that converts text images into editable text documents. Today, these systems are widely used in the dematerialization applications such as mail sorting, bill management, etc. In this context, the aim of this thesis is to propose an OCR system that provides a better compromise between recognition rate and processing speed which allows to give a reliable and a real time documents dematerialization. To ensure its recognition, the text is firstly extracted from the background. Then, it is segmented into disjoint characters that are described based on their structural characteristics. Finally, the characters are recognized when comparing their descriptors with a predefined ones.The text extraction, based on binarization methods remains difficult in heterogeneous and scanned documents with a complex and noisy background where the text may be confused with a textured background or because of the noise. On the other hand, the description of characters, and the extraction of segments, are often complex using calculation of geometricaltransformations, polygon, including a large number of characteristics or gives low discrimination if the characteristics of the selected type are sensitive to variation of scale, style, etc. For this, we adapt our algorithms to the type of heterogeneous and scanned documents. We also provide a high discriminatiobn between characters that descriptionis based on the study of the structure of the characters according to their horizontal and vertical projections. To ensure real-time processing, we parallelise algorithms developed on the graphics processor (GPU). Our main contributions in our proposed OCR system are as follows:A new binarisation method for heterogeneous and scanned documents including text regions with complex or homogeneous background. In this method, an image analysis process is used followed by a classification of the document areas into images (text with a complex background) and text (text with a homogeneous background). For text regions is performed text extraction using a hybrid method based on classification algorithm Kmeans (CHK) that we have developed for this aim. This method combines local and global approaches. It improves the quality of separation text/background, while minimizing the amount of distortion for text extraction from the scanned document and noisy because of the process of digitization. The image areas are improved with Gamma Correction (CG) before applying HBK. According to our experiment, our text extraction method gives 98% of character recognition rate on heterogeneous scanned documents.A Unified Character Descriptor based on the study of the character structure. It employs a sufficient number of characteristics resulting from the unification of the descriptors of the horizontal and vertical projection of the characters for efficient discrimination. The advantage of this descriptor is both on its high performance and its simple computation. It supports the recognition of alphanumeric and multiscale characters. The proposed descriptor provides a character recognition 100% for a given Face-type and Font-size.Parallelization of the proposed character recognition system. The GPU graphics processor has been used as a platform of parallelization. Flexible and powerful, this architecture provides an effective solution for accelerating intensive image processing algorithms. Our implementation, combines coarse/fine-grained parallelization strategies to speed up the steps of the OCR chain. In addition, the CPU-GPU communication overheads are avoided and a good memory management is assured. The effectiveness of our implementation is validated through extensive experiments
Document type :
Theses
Complete list of metadatas

https://pastel.archives-ouvertes.fr/tel-01548457
Contributor : Abes Star <>
Submitted on : Tuesday, June 27, 2017 - 4:16:09 PM
Last modification on : Thursday, July 5, 2018 - 2:29:13 PM
Long-term archiving on : Wednesday, January 17, 2018 - 9:07:05 PM

File

TH2016PESC1069_diffusion.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01548457, version 1

Citation

Mahmoud Soua. Extraction hybride et description structurelle de caractères pour une reconnaissance efficace de texte dans les documents hétérogènes scannés : Méthodes et Algorithmes parallèles. Informatique et langage [cs.CL]. Université Paris-Est, 2016. Français. ⟨NNT : 2016PESC1069⟩. ⟨tel-01548457⟩

Share

Metrics

Record views

392

Files downloads

822