We have just finished a new version of the Croatian top-level-domain web corpus hrWaC v2.0 and the first versions of the Bosnian and Serbian TLD web corpora bsWaC v1.0 and srWaC v1.0.The corpora are 1.9 billion, 429 million and 894 million tokens in size. They are annotated with lemmas and morphosyntactic descriptions while the dependency syntax layer will follow shortly (40 server-grade cores melting). Additionally, they contain document-level metadata that discriminate between the three similar languages and assess the text quality on the lexical and syntactic level. The corpora are published under the CC-BY-SA license. Please contact Nikola for a download link to the corpora.
The prepared dataset is compiled from the dump of the Croatian Wikipedia. The worker tasks consist of tokens, their assumed morphosyntactic description and context for which the HunPos and the TreeTagger tools, both trained on the same dataset, do not agree on. With crowdsourcing we hope to eliminate the correctly annotated tokens from the final checkup of the tagger disagreements which will be performed by a language professional.
You will need a Google account to register with the tool.
The result of the crowdsourcing efforts will be freely available under the CC-BY-SA license as all our other data.
The Abu-MaTran project consortium, inside which Nikola is the local coordinator, published recently it’s first milestone — an English-Croatian translator on http://translator.abumatran.eu. This is the baseline system on top of which the consortium will make advances in the next 3.5 years of the project.
We have just published the SETimes.HR multi-level annotated corpus of Croatian consisting of ~ 8.000 sentences. It contains ~ 4 000 sentences manually lemmatized and morphosyntactically tagged, ~ 2 500 sentences annotated for syntactic dependencies and the entire corpus annotated with three ENAMEX classes of named entities. The corpus is published under the permissive CC-BY-SA-3.0 license.