bsWaC – Bosnian web corpus

bsWaC is a web corpus collected from the .ba top-level domain. The 1.0 version of the corpus contains 429 million tokens and is annotated with the lemma, morphosyntax and dependency syntax layers.

The compilations of the 1.0 version of the corpus is described in the WAC-9 paper “{bs,hr,sr}WaC — Web corpora of Bosnian, Croatian and Serbian” pdf bib.

The corpus is distributed under the CC-BY-SA license. A full-text version of the corpus can be downloaded from

You can query the corpus via the “iframed” interface below or go to the web interface directly.