Paste or type the text:

About

Most parsers rely on finite state software (HFST, FST, etc.) which are hard to install and run for novice users. Also these parsers need word-per-line input in order to produce effective results. This demo combines two steps into one; putting input text into WPL format automatically and then parsing.

This study is based on "An averaged perceptron-based morphological disambiguator for Turkish text", but modified in order to parse some new classes and produce useable output with CWB or likewise software.

Lexicon comes from two sources.

  • The first source is the corpora build under TS Corpus Project. Specially TweetS Corpus (Login - Register). This corpus contains 13M+ tokens harvested from twitter, therefore it is a good collection for "internet specific" usages. These word classes are mostly derived from this corpus, like intabbr, intEmphasis, emoticon, intSlang and YY.
  • The second source is TrMorph by Çağrı Çöltekin. TrMorph has a very rich and useful lexicon source. The word classes like abbr and tinglish are from TrMorph lexicon.
  • A TweeTS Corpus which is built using this tagset is available on TS Corpus. The new tags are:

  • intabbr (Internet Abbreviations): These are internet specific shortened forms of words. The words in this class have a missing character/s. (slm - selam[hi], tşk - teşekkürler[thanks], cnm - canım[dear], etc.)
  • intEmphasis (Internet Emphasis): These words are mostly interjections but have repetation of a character in order to emphasis the meaning. (çoook - çok[very], evett - evet[yes], acabaa - acaba[I wonder], etc.)
  • Emoticon (Emoticons&Smileys): These are symbols that represent a feeling or a facial expression mostly using punctuations. (:( -Sad Face, :) - Smile, :| - No_Expression, etc.)
  • intSlang (Internet Slang): This class is for internet specific slang words. Some of them have cencored vowels, some are like abbreviations. (A.Q. - [--], g*t - [---], etc.)
  • YY (Misspelling): This class represents misspelled words. Some of these words are misspelled due to lack of Turkish keyboard schemes and some used on purpose. (qIsA - kısa[short], qibi - gibi[like], eylul - eylül[september], etc.)
  • HTML_entities (HTML Entities): These are HTML entities. These characters are reserved for HTMl, SGML or XML. In order to escape these characters ts_tokenizer converts them to entities. (< ,>, &)
  • abbr (Abbreviations): This class has abbreviations from TrMorph abbreviations lexicon and TDK. (IMF - [International_Monetary_Fund], LPG - [Liquified_Petroleum_Gas], TBMM - Türkiye Büyük Millet Meclisi[Turkish Grand National Assembly], etc.)
  • tinglish: This words are from TrMorph lexicon. This file lexicon contains words that are generally direct transliteration of (mostly) English words. (Çöltekin, 2013) (feysbuk - [facebook], tivit - [tweet], etc.)
  • References

  • Sak, H., Güngör, T., Saraçlar, M. (2008) Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus
  • Çağrı Çöltekin (2010). A Freely Available Morphological Analyzer for Turkish In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC2010), Valletta, Malta, May 2010.
  • Sezer, B., Sezer, T. 2013. TS Corpus: Herkes için Türkçe Derlem. 27. Ulusal Dilbilim Kurultayı Bildiri Kitabı. 3-4 Mayıs 2013. Antalya, Kemer: Hacettepe Üniversitesi, İngiliz Dilbilim Bölümü, 217-225