HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Deriving semantic objects from the structured web

Abstract : This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web pages by content management systems, we present two keyword-based techniques that perform article extraction from such pages. Keywords, automatically acquired, guide the process of object identification, either at the level of a single Web page (SIGFEED), or across different pages sharing the same template (FOREST). We finally present, in the context of the deep Web, a generic framework that aims at discovering the semantic model of a Web object (here, data record) by, first, using FOREST for the extraction of objects, and second, representing the implicit rdf:type similarities between the object attributes and the entity of the form as relationships that, together with the instances extracted from the objects, form a labeled graph. This graph is further aligned to an ontology like YAGO for the discovery of the unknown types and relations.
Keywords : Deep web
Document type :
Complete list of metadata

Cited literature [134 references]  Display  Hide  Download

Contributor : Abes Star :  Contact
Submitted on : Friday, March 6, 2015 - 6:37:32 AM
Last modification on : Friday, July 31, 2020 - 10:44:08 AM
Long-term archiving on: : Sunday, June 7, 2015 - 3:25:49 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01124278, version 1



Marilena Oita. Deriving semantic objects from the structured web. Other [cs.OH]. Télécom ParisTech, 2012. English. ⟨NNT : 2012ENST0060⟩. ⟨tel-01124278⟩



Record views


Files downloads