Stemmer for Croatian | Natural Language Processing group

A simple rule-based stemmer for Croatian

Ivan Pandžić, a student of mine, and I have built a simple rule-based stemmer for Croatian (this is actually a refinement / redesign of the stemmer presented in my InFuture 2007 paper) which we publish under the GNU Lesser General Public License.

It performs a series of transformations (defined in transformations.txt) that take care of morphonological changes and a series of rules (defined in rules.txt) that remove the suffixes. The stemmer in general works best on adjectives and nouns since, while working on it, we had information retrieval tasks in mind.

We performed basic evaluation of the stemmer on a lemmatized newspaper corpus as gold standard with a precision of 0.986 and recall of 0.961 (F1 0.973) for adjectives and nouns. On all parts of speech we obtained precision of 0.98 and recall of 0.92 (F1 0.947).

Download the stemmer!