Recently, the Czech Insolvency Register covers about 200 000 insolvency proceedings commenced since 2008. To each respective insolvency proceeding, several scanned document copies can be attached (i.e., cca 1200000 pdf-files in all).
This study aims at finding efficient pre-processing, clustering and classification techniques capable of extracting valid information on the indebtedness structure across the Czech society from the above-mentioned pdf-files.