HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Theses

Deriving semantic objects from the structured web

Abstract : This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web pages by content management systems, we present two keyword-based techniques that perform article extraction from such pages. Keywords, automatically acquired, guide the process of object identification, either at the level of a single Web page (SIGFEED), or across different pages sharing the same template (FOREST). We finally present, in the context of the deep Web, a generic framework that aims at discovering the semantic model of a Web object (here, data record) by, first, using FOREST for the extraction of objects, and second, representing the implicit rdf:type similarities between the object attributes and the entity of the form as relationships that, together with the instances extracted from the objects, form a labeled graph. This graph is further aligned to an ontology like YAGO for the discovery of the unknown types and relations.
Keywords : Deep web
Document type :
Theses
Complete list of metadata

Cited literature [134 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/tel-01124278
Contributor : Abes Star :  Contact
Submitted on : Friday, March 6, 2015 - 6:37:32 AM
Last modification on : Friday, July 31, 2020 - 10:44:08 AM
Long-term archiving on: : Sunday, June 7, 2015 - 3:25:49 PM

File

2012ENST0060.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01124278, version 1

Collections

Citation

Marilena Oita. Deriving semantic objects from the structured web. Other [cs.OH]. Télécom ParisTech, 2012. English. ⟨NNT : 2012ENST0060⟩. ⟨tel-01124278⟩

Share

Metrics

Record views

207

Files downloads

386