Twitter corpora of BCMS and Slovene | Natural Language Processing group

The corpora of BCMS (Bosnian, Croatian, Montenegrin and Serbian) and Slovene are build with the TweetCaT tool. The collection process has started in June 2013 and is still under way.

There are ongoing efforts on discriminating between BCMS and specific corpora for each of those languages will follow.

We distribute the current versions of the corpora by following the Twitter Terms of use as tweet ids that should be used to rebuild the corpora via the Twitter API. The process of building the corpora is described in the following paper:
Nikola Ljubešić, Darja Fišer and Tomaž Erjavec: TweetCaT: a tool for building Twitter corpora of smaller languages. In: LREC 2014. pdf bib.

corpus	# of users	# of tweets	# of words
hrsrTwitterCorpus (BCMS)	54,578	33,311,813	379,255,987
slTwitterCorpus (Slovene)	6,540	5,017,277	64,623,430