SETimes – A Parallel Corpus of English and South-East European Languages

The corpus is based on the content published on the news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

  • stricter extraction process – no HTML residues present
  • language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language
  • resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

From the following table you can obtain sentence-aligned language combinations in TMX (upper right) or TXT / Moses (lower left) format.

The corpus is published under the CC-BY-SA license.

bg bs el en hr mk ro sq sr tr
bg 136k 212k 213k 203k 207k 211k 212k 211k 206k
bs 136k 138k 138k 138k 133k 137k 138k 136k 134k
el 212k 138k 227k 205k 207k 212k 227k 224k 207k
en 213k 138k 227k 206k 208k 213k 228k 225k 208k
hr 203k 138k 205k 206k 199k 204k 205k 204k 199k
mk 207k 133k 207k 208k 199k 206k 207k 207k 203k
ro 211k 137k 212k 213k 204k 206k 212k 211k 206k
sq 212k 138k 227k 228k 205k 207k 212k 225k 207k
sr 211k 136k 224k 225k 204k 207k 211k 225k 206k
tr 206k 134k 207k 208k 199k 203k 206k 207k 206k