Natural Language Processing group

Menu

Skip to content
  • Home
  • Resources
    • Corpora
      • SETimes.HR treebank
      • Twitter corpora of BCMS and Slovene
      • hrWaC – Croatian web corpus
      • slWaC – Slovene web corpus
      • srWaC – Serbian web corpus
      • bsWaC – Bosnian web corpus
      • caWaC — Catalan web corpus
      • SETimes
      • hrenWaC
      • TED talks
      • CroLTeC
    • Lexicons
      • hrLex
      • srLex
      • hrMWELex
      • srMWELex
      • slMWELex
      • CROVALLEX
    • Models
      • Tagging
      • Dependency parsing
      • NER
    • Tools
      • Stemmer for Croatian
      • CollTerm
      • BS-HR-SR LID
      • ccLexEx
  • People
    • Željko Agić
    • Nikola Ljubešić
    • Nives Mikelić Preradović
    • Petra Bago
  • Publications

Corpora

Monolingual corpora

  • SETimes.HR corpus and dependency treebank of Croatian
  • hrWaC, the Croatian web corpus
  • slWaC, the Slovene web corpus
  • srWaC, the Serbian web corpus
  • bsWaC, the Bosnian web corpus
  • CroLTeC, the CROatian Learner TExt Corpus

Multilingual corpora

  • SETimes corpus, a 10-languages corpus built from the setimes.com domain
  • hrenWaC corpus, a Croatian–English parallel corpus built from the hrWaC web corpus data
  • TED talks parallel corpus — 86,348 Croatian–English sentence pairs

Recent Posts

  • New API release
  • API for our language technologies
  • Web corpora of Bosnian, Croatian and Serbian top-level domain published
  • Crowdsourcing speakers of Croatian for improving basic language tools
  • Abu-MaTran project publishes the first milestone — an English-Croatian translator
Proudly powered by WordPress