BS-HR-SR LID

Language identifier for Bosnian, Croatian and Serbian language

The identifier is actually a Naive Bayes classifier trained on the tritext of Bosnian, Croatian and Serbian from the SETimes corpus using lowercased tokens as features. The code is published under the GNU Lesser General Public License. Download here the tool and data.

I evaluated the model and compared it to the second-order Markov-chain character model built from the same data (the method is described in my ITI 2007 paper) on 100 manually checked documents for each language retrieved from these Internet domains: http://www.dnevniavaz.ba, http://www.vecernji.hr and http://www.politika.rs (Cyrillic was transliterated to Latin). This data is actually part of the bsWaC, hrWaC and srWaC corpora. The evaluation set is distributed together with the tool and models. Please let me know how your results compare to mine.

The confusion table for the second-order Markov-chain character model with accuracy of 90.3% is this:

bs hr sr
bs 173 17 10
hr 30 170 0
sr 1 0 199

The confusion table for the Naive-Bayes token model with accuracy of 95.7% is this:

bs hr sr
bs 181 11 8
hr 7 193 0
sr 0 0 200

Since distinguishing between Bosnian and Croatian is obviously pretty hard, beside the BS-HR-SR model (bs-hr-sr.classifier) I publish a model for discriminating between Croatian and Serbian (hr-sr.classifier). The confusion tables for the Markov-chain and the Naive-Bayes approach are these:

Markov chain:

hr sr
hr 200 0
sr 1 199

Naive Bayes:

hr sr
hr 200 0
sr 0 200