slWaC – Slovene web corpus | Natural Language Processing group

slWaC is a web corpus collected from the .si top-level domain. The current version of the corpus (v2.0) contains 1.2 billion tokens and is annotated with the lemma and the morphosyntax layer.

The compilations of the 1.0 version of the corpus is described in the TSD2011 paper “hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene” pdf bib while the 2.0 version is described in the JT2014 paper “The slWaC 2.0 Web corpus of the Slovene Web” (in press).

The corpus is distributed under the CC-BY-SA license. Please contact Nikola for a full-text copy of the corpus.

You can query the v2.0 version of the corpus via the “iframed” interface below or go to the web interface directly.