Charles Explorer logo
🇬🇧

Victor

Publication

Abstract

Victor is a tool for cleaning web pages. It employs a sequence-labeling approach based on Conditional Random Fields (CRF).

Every block of text in the analyzed web page is assigned a set of features extracted from the textual content and HTML structure of the page. Text blocks are automatically labeled either as content segments containing main web page content, which should be preserved, or as noisy segments not suitable for further linguistic processing, which should be eliminated.

Keywords