Skip to Main content Skip to Navigation

Deriving semantic objects from the structured web

Abstract : This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web pages by content management systems, we present two keyword-based techniques that perform article extraction from such pages. Keywords, automatically acquired, guide the process of object identification, either at the level of a single Web page (SIGFEED), or across different pages sharing the same template (FOREST). We finally present, in the context of the deep Web, a generic framework that aims at discovering the semantic model of a Web object (here, data record) by, first, using FOREST for the extraction of objects, and second, representing the implicit rdf:type similarities between the object attributes and the entity of the form as relationships that, together with the instances extracted from the objects, form a labeled graph. This graph is further aligned to an ontology like YAGO for the discovery of the unknown types and relations.
Keywords : Deep web
Document type :
Complete list of metadata

Cited literature [134 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Friday, March 6, 2015 - 6:37:32 AM
Last modification on : Friday, July 31, 2020 - 10:44:08 AM
Long-term archiving on: : Sunday, June 7, 2015 - 3:25:49 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01124278, version 1



Marilena Oita. Deriving semantic objects from the structured web. Other [cs.OH]. Télécom ParisTech, 2012. English. ⟨NNT : 2012ENST0060⟩. ⟨tel-01124278⟩



Record views


Files downloads