Charles Explorer logo

UMC005: English-Urdu Parallel Corpus



English-Urdu Parallel Corpus serves training of statistical machine translation between these two languages. It consists of four parts:

1. English-Urdu part of the EMILLE corpus;

2. texts from the Wall Street Journal (Penn Treebank);

3. translations of the Quran;

4. translations of the Bible. Parallel data that existed before (EMILLE) have been completely and newly manually cleaned, corrected alignment and many sentences on the Urdu side.