Strigil: A Framework for Data Extraction in Semi-Structured Web Documents

Publication at Faculty of Mathematics and Physics |

2013

Abstract

In this paper we introduce Strigil, a framework for automated data extraction. It represents an easily con gurable tool that enables one to retrieve a data from textual or weak structured documents.

The paper contains description of the framework architecture and its important components. Additionally, we propose a scraping language inspired by the XSL transformations designed to extract data from di erent kinds of documents.

Although there are many di erent approaches focused on various aspects of data scraping, they are usually very specialized to a concrete domain or a data source. We compare these solutions and discuss their advantages and disadvantages.

Our scraping language is designed to work with an ontology to map scraped data directly to classes and attributes.

Keywords

Strigil Framework Data Extraction Semi-Structured Data Web