Tagging | Natural Language Processing group

Morphosyntactic tagging and lemmatization

We have recently published an API with significantly improved processing accuracy. You can find the documentation on the API here, while there is a web application for testing the API available here.

Lemmatization and morphosyntactic tagging models for Croatian using CST lemmatizer and HunPos tagger. Utilized in an experiment with lemmatization and morphosyntactic tagging of Croatian and Serbian using the same models. Download the models:

lemmatization, default CST settings (cc-by-sa.cst.tgz)
tagging, default HunPos settings (cc-by-sa.hunpos)

The models are built from a manually lemmatized and morphosyntactically tagged 90 kw SETimes.HR corpus of Croatian. Lemmatization accuracy was observed at approximately 98% and full revised Multext East version 4 morphosyntactic tagging accuracy at 87%, with POS-only accuracy of 97%. The data is provided under the CC-BY-SA-3.0 license.

Please cite this paper when using the models:

Agić, Ž.; Ljubešić, N.; Merkler, D. (2013.) Lemmatization and Morphosyntactic Tagging of Croatian and Serbian. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013). Sofia, Bulgaria, Association for Computational Linguistics, 2013, pp. 48–57.