SETimes | Natural Language Processing group

SETimes – A Parallel Corpus of English and South-East European Languages

The corpus is based on the content published on the SETimes.com news portal. The news portal publishes “news and views from Southeast Europe” in ten languages: Bulgarian, Bosnian, Greek, English, Croatian, Macedonian, Romanian, Albanian and Serbian. This version of the corpus tries to solve the issues present in an older version of the corpus (published inside OPUS, described in the LREC 2010 paper by Francis M. Tyers and Murat Serdar Alperen). The following procedures were applied to resolve existing issues:

stricter extraction process – no HTML residues present

language identification on every non-English document – non-English online documents contain English material in case the article was not translated into that language

resolving encoding issues in Croatian and Serbian – diacritics were partially lost due to encoding errors – text was rediacritized.

From the following table you can obtain sentence-aligned language combinations in TMX (upper right) or TXT / Moses (lower left) format.

The corpus is published under the CC-BY-SA license.

	bg	bs	el	en	hr	mk	ro	sq	sr	tr
bg	–	136k	212k	213k	203k	207k	211k	212k	211k	206k
bs	136k	–	138k	138k	138k	133k	137k	138k	136k	134k
el	212k	138k	–	227k	205k	207k	212k	227k	224k	207k
en	213k	138k	227k	–	206k	208k	213k	228k	225k	208k
hr	203k	138k	205k	206k	–	199k	204k	205k	204k	199k
mk	207k	133k	207k	208k	199k	–	206k	207k	207k	203k
ro	211k	137k	212k	213k	204k	206k	–	212k	211k	206k
sq	212k	138k	227k	228k	205k	207k	212k	–	225k	207k
sr	211k	136k	224k	225k	204k	207k	211k	225k	–	206k
tr	206k	134k	207k	208k	199k	203k	206k	207k	206k	–