Morphosyntactic tagging and lemmatization
Lemmatization and morphosyntactic tagging models for Croatian using CST lemmatizer and HunPos tagger. Utilized in an experiment with lemmatization and morphosyntactic tagging of Croatian and Serbian using the same models. Download the models:
- lemmatization, default CST settings (cc-by-sa.cst.tgz)
- tagging, default HunPos settings (cc-by-sa.hunpos)
The models are built from a manually lemmatized and morphosyntactically tagged 90 kw SETimes.HR corpus of Croatian. Lemmatization accuracy was observed at approximately 98% and full revised Multext East version 4 morphosyntactic tagging accuracy at 87%, with POS-only accuracy of 97%. The data is provided under the CC-BY-SA-3.0 license.
Please cite this paper when using the models:
Agić, Ž.; Ljubešić, N.; Merkler, D. (2013.) Lemmatization and Morphosyntactic Tagging of Croatian and Serbian. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013). Sofia, Bulgaria, Association for Computational Linguistics, 2013, pp. 48–57.