Charles Explorer logo
🇬🇧

Efficiency increase in HTML files compression by their proper preprocessing

Publication at Faculty of Mathematics and Physics |
2007

Abstract

Web search engines need to store huge amount of data - especially in the html format. It is therefore useful to reduce their size by compression - but there must be preserved fast access to the documents (the decompression cannot take too long time).

Further limitation is that many html pages do not conform to any of the html standards. it is often impossible to use the knowledge of the html format. We have therefore decided to improve the compression ratio of existing applications gzip and bzip2 by preprocessing of the documents to be compressed.

The preprocessing is based on substitution of the most frequent tags and attributes by shorter symbols. It fastens initialization of the dictionary of gzip and simplifies the input of bzip2.

Almost in all cases the size of compressed files is smaller in when preprocessing is used than without it. The onyl exception are the very small files.