Web corpora of Bosnian, Croatian and Serbian top-level domain published

We have just finished a new version of the Croatian top-level-domain web corpus hrWaC v2.0 and the first versions of the Bosnian and Serbian TLD web corpora bsWaC v1.0 and srWaC v1.0.The corpora are 1.9 billion, 429 million and 894 million tokens in size. They are annotated with lemmas and morphosyntactic descriptions while the dependency syntax layer will follow shortly (40 server-grade cores melting). Additionally, they contain document-level metadata that discriminate between the three similar languages and assess the text quality on the lexical and syntactic level. The corpora are published under the CC-BY-SA license. Please contact Nikola for a download link to the corpora.