Language identifier for Bosnian, Croatian and Serbian language
The identifier is actually a Naive Bayes classifier trained on the tritext of Bosnian, Croatian and Serbian from the SETimes corpus using lowercased tokens as features. The code is published under the GNU Lesser General Public License. Download here the tool and data.
I evaluated the model and compared it to the second-order Markov-chain character model built from the same data (the method is described in my ITI 2007 paper) on 100 manually checked documents for each language retrieved from these Internet domains: http://www.dnevniavaz.ba, http://www.vecernji.hr and http://www.politika.rs (Cyrillic was transliterated to Latin). This data is actually part of the bsWaC, hrWaC and srWaC corpora. The evaluation set is distributed together with the tool and models. Please let me know how your results compare to mine.
The confusion table for the second-order Markov-chain character model with accuracy of 90.3% is this:
bs | hr | sr | |
bs | 173 | 17 | 10 |
hr | 30 | 170 | 0 |
sr | 1 | 0 | 199 |
The confusion table for the Naive-Bayes token model with accuracy of 95.7% is this:
bs | hr | sr | |
bs | 181 | 11 | 8 |
hr | 7 | 193 | 0 |
sr | 0 | 0 | 200 |
Since distinguishing between Bosnian and Croatian is obviously pretty hard, beside the BS-HR-SR model (bs-hr-sr.classifier) I publish a model for discriminating between Croatian and Serbian (hr-sr.classifier). The confusion tables for the Markov-chain and the Naive-Bayes approach are these:
Markov chain:
hr | sr | |
hr | 200 | 0 |
sr | 1 | 199 |
Naive Bayes:
hr | sr | |
hr | 200 | 0 |
sr | 0 | 200 |