hrWaC – Croatian web corpus

hrWaC is a web corpus collected from the .hr top-level domain. The current version of the corpus (v2.0) contains 1.9 billion tokens and is annotated with the lemma, morphosyntax and dependency syntax layers.

The compilations of the 1.0 version of the corpus is described in the TSD2011 paper “hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene” pdf bib while the 2.0 version is described in the WAC-9 paper “{bs,hr,sr}WaC — Web corpora of Bosnian, Croatian and Serbian” pdf bib.

The corpus is distributed under the CC-BY-SA license. A full-text version of the corpus can be downloaded from http://hdl.handle.net/11356/1064.

You can query the v2.1 version of the corpus via the “iframed” interface below or go to the web interface directly.