TED talks | Natural Language Processing group

Croatian-English TED talks parallel corpus

Collection of parallel sentences extracted from the Croatian-English TED talks transcripts available from WIT³. The transcripts were paired, sentence-delimited using OpenNLP and sentence-aligned using hunalign. The corpus is available under the CC-BY-NC-SA-3.0 license. Please read the terms of use on WIT³ regarding TED talks licensing as they apply here as well. The corpus consists of 86.348 sentence pairs, 2.384.887 tokens (en: 1.289.367, hr: 1.095.520). It is available for download in Moses format locally (ted_talks_en-hr_2013-01-03.tar.gz) and in Moses and TMX formats including various metadata for both Croatian and English side from OPUS (http://opus.lingfil.uu.se/TedTalks.php).