CroLTeC | Natural Language Processing group

CroLTeC (CROatian Learner TExt Corpus) contains texts collected from learners of Croatian as a second and foreign language (from beginners – A1 to advanced learners – C1 and higher). The purpose of CROLTEC corpus is to enable the in-depth analysis of learner language, describe that language and allow the extraction of important linguistic patterns, as well as contrastive interlanguage analysis and computer-aided error analysis.

CroLTeC consists of transcribed manuscripts with preserved corrections made by learners themselves (deletions, insertions and changes in the word order). Texts are systematically described by detailed socio-linguistic metadata (gender, age, nationality, mother tongue, bilingual and multilingual competence and parents’ language proficiency).

Also, language instructors were asked to submit a report on the essay topic accompanying each weekly essay with the following data (that were used as essay meta tags): the level of linguistic competence required, the title of the essay, the number of the essay/week of learning, genre, scope, conditions under which the essay was produced (time limit, size limit, etc.) and the circumstances under which the essay was produced (homework, part of the exam, part of the field work, etc.).

CroLTeC corpus currently contains 6,213 anonymized transcripts in XML format and 1217 essays that were digitally born and converted to XML.

It contains 1.054,287 words. Essays are collected from 755 learners with 36 different first language backgrounds.

The corpus is available here: http://teitok.clul.ul.pt/croltec/.