SETimes.HR treebank | Natural Language Processing group

SETimes.HR Croatian dependency treebank

The corpus is based on the Croatian part of the SETimes parallel corpus. Corpus features:

Croatian newspaper text from Southeast European Times
~ 8 000 sentences, 180 000 word forms
- ~ 4 000 sentences manually lemmatized and morphosyntactically tagged
- used the revised Multext East version 4 tagset
- ~ 2 500 sentences annotated for syntactic dependencies
- new simplistic syntactic tagset
- entire corpus manually annotated for named entities (three ENAMEX classes: PER, LOC, ORG)
download the corpus in CoNLL-X format (setimes.hr.v1.conllx.tar.gz)
the latest copy of the corpus can be obtained from GitHub at https://github.com/ffnlp/sethr

Please note that the corpus is split into two parts. The first part contains manual annotation on all annotation layers. The second one is manually annotated for named entities and automatically lemmatized and MSD-tagged. The cutoff point is line 93 124 of the corpus file. Syntactically annotated sentences can be identified by looking for non-empty CoNLL-X DEPREL column.

The corpus is provided under the CC-BY-SA-3.0 license. Please cite the following papers when using the models. They address the morphosyntactic, syntactic and named entity annotation layer of the corpus, respectively.

Agić, Ž.; Ljubešić, N.; Merkler, D. (2013.) Lemmatization and Morphosyntactic Tagging of Croatian and Serbian. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013). Sofia, Bulgaria, Association for Computational Linguistics, 2013, pp. 48–57.

Agić, Ž.; Merkler, D. (2013.) Three Syntactic Formalisms for Data-Driven Dependency Parsing of Croatian. In Text, Speech and Dialogue. Lecture Notes in Computer Science. Berlin, Heidelberg, Springer, 2013, pp. 560–567.

Ljubešić, N.; Stupar, M.; Jurić, T.; Agić, Ž. (2013.) Combining Available Datasets for Building Named Entity Recognition Models of Croatian and Slovene. Slovenščina 2.0: empirical, applied and interdisciplinary research, in press.

In these papers, we describe experiments with lemmatization and morphosyntactic tagging, dependency parsing and named entity recognition using the SETimes.HR corpus. Follow the links to obtain the trained models and test sets.