SETimes.HR treebank

SETimes.HR Croatian dependency treebank

The corpus is based on the Croatian part of the SETimes parallel corpus. Corpus features:

Please note that the corpus is split into two parts. The first part contains manual annotation on all annotation layers. The second one is manually annotated for named entities and automatically lemmatized and MSD-tagged. The cutoff point is line 93 124 of the corpus file. Syntactically annotated sentences can be identified by looking for non-empty CoNLL-X DEPREL column.

The corpus is provided under the CC-BY-SA-3.0 license. Please cite the following papers when using the models. They address the morphosyntactic, syntactic and named entity annotation layer of the corpus, respectively.

Agić, Ž.; Ljubešić, N.; Merkler, D. (2013.) Lemmatization and Morphosyntactic Tagging of Croatian and Serbian. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013). Sofia, Bulgaria, Association for Computational Linguistics, 2013, pp. 48–57.

Agić, Ž.; Merkler, D. (2013.) Three Syntactic Formalisms for Data-Driven Dependency Parsing of Croatian. In Text, Speech and Dialogue. Lecture Notes in Computer Science. Berlin, Heidelberg, Springer, 2013, pp. 560–567.

Ljubešić, N.; Stupar, M.; Jurić, T.; Agić, Ž. (2013.) Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene. Slovenščina 2.0: empirical, applied and interdisciplinary research, in press.

In these papers, we describe experiments with lemmatization and morphosyntactic taggingdependency parsing and named entity recognition using the SETimes.HR corpus. Follow the links to obtain the trained models and test sets.