hrenWaC | Natural Language Processing group

hrenWaC – Croatian-English Parallel Web Corpus

The corpus consists of 99,001 Croatian–English sentence pairs. It is published under the CC-BY-SA license.

The parallel document pair candidates were automatically extracted from the hrWaC corpus by Nikola Ljubešić. These candidates were manually checked by Daša Berović and Danijela Merkler under the supervision of Marko Tadić.

TMX file

TXT / Moses file

In 2015 we have developed the Spidextor tool (https://github.com/abumatran/spidextor) allowing an automated way of producing collections of bitext from TLD crawls. With that tool we have collected the second version of the corpus. This version of the corpus 1. is not manually checked and 2. does not (explicitly) contain the texts from the previous version. It can be downloaded from http://hdl.handle.net/11356/1058.