New API release

For some time now we are working on the new generation of our language technologies for the three main western South Slavic languages — Slovene, Croatian and Serbian. Recently we have released a new API currently covering the inflectional lexicon, segmenter, morphosyntactic tagger and lemmatiser on both the web GUI and the API side, while the diacritic restorer is available through the API only. A dependency parser, named entity recogniser and non-standard text normaliser will follow. You can register and use the web GUI at http://nl.ijs.si/services while the easiest way to use the API is through the library available from https://github.com/uzh/reldi/tree/master/tools/lib.

API for our language technologies

We have recently finished developing the first version of our API enabling you to use our language technologies (tokenizer, sentence splitter, morphosyntactic tagger, lemmatizer, named entity recognizer, dependency parser) and a web-based GUI enabling you to use the API without programming skills. You can find the GUI as well as basic documentation how to use the API on http://faust.ffzg.hr/nlpws/gui/.

Web corpora of Bosnian, Croatian and Serbian top-level domain published

We have just finished a new version of the Croatian top-level-domain web corpus hrWaC v2.0 and the first versions of the Bosnian and Serbian TLD web corpora bsWaC v1.0 and srWaC v1.0.The corpora are 1.9 billion, 429 million and 894 million tokens in size. They are annotated with lemmas and morphosyntactic descriptions while the dependency syntax layer will follow shortly (40 server-grade cores melting). Additionally, they contain document-level metadata that discriminate between the three similar languages and assess the text quality on the lexical and syntactic level. The corpora are published under the CC-BY-SA license. Please contact Nikola for a download link to the corpora.

Crowdsourcing speakers of Croatian for improving basic language tools

If you happen to speak Croatian, feel free to join our crowdsourcing efforts through which we try to produce more annotated data and improve our existing models for basic language tools for Croatian.

The prepared dataset is compiled from the dump of the Croatian Wikipedia. The worker tasks consist of tokens, their assumed morphosyntactic description and context for which the HunPos and the TreeTagger tools, both trained on the same dataset, do not agree on. With crowdsourcing we hope to eliminate the correctly annotated tokens from the final checkup of the tagger disagreements which will be performed by a language professional.

You will need a Google account to register with the tool.

The result of the crowdsourcing efforts will be freely available under the CC-BY-SA license as all our other data.