caWaC — Catalan web corpus | Natural Language Processing group

caWaC is a 780-million-token web corpus of Catalan built from the .cat top-level-domain in late 2013. We are releasing the corpus (1.6G) in a sentence-deduped and scrambled format (24,745,986 sentences, 733,974,675 words) under the CC-BY-SA 3.0 license. Please contact us if you require the corpus with the document structure intact.

If you use the resource, please cite the following paper:
Nikola Ljubešić and Antonio Toral. caWaC — A web corpus of Catalan and its application to language modeling and machine translation. In: LREC 2104. pdf bib