|
About
Most parsers rely on finite state software (HFST, FST, etc.) which are hard to install and run for novice users. Also these parsers need word-per-line input in order to produce effective results.
This demo combines two steps into one; putting input text into WPL format automatically and then parsing.
This study is based on "An averaged perceptron-based morphological disambiguator for Turkish text", but modified in order to parse some new classes and produce useable output with CWB or likewise software.
Lexicon comes from two sources.
The first source is the corpora build under TS Corpus Project. Specially TweetS Corpus (Login - Register). This corpus contains 13M+ tokens harvested from twitter, therefore it is a good collection for "internet specific" usages. These word classes are mostly derived from this corpus, like intabbr, intEmphasis, emoticon, intSlang and YY.
The second source is TrMorph by Çağrı Çöltekin. TrMorph has a very rich and useful lexicon source. The word classes like abbr and tinglish are from TrMorph lexicon.
A TweeTS Corpus which is built using this tagset is available on TS Corpus. The new tags are:
intabbr (Internet Abbreviations): These are internet specific shortened forms of words. The words in this class have a missing character/s. (slm - selam[hi], tşk - teşekkürler[thanks], cnm - canım[dear], etc.)
intEmphasis (Internet Emphasis): These words are mostly interjections but have repetation of a character in order to emphasis the meaning. (çoook - çok[very], evett - evet[yes], acabaa - acaba[I wonder], etc.)
Emoticon (Emoticons&Smileys): These are symbols that represent a feeling or a facial expression mostly using punctuations. (:( -Sad Face, :) - Smile, :| - No_Expression, etc.)
intSlang (Internet Slang): This class is for internet specific slang words. Some of them have cencored vowels, some are like abbreviations. (A.Q. - [--], g*t - [---], etc.)
YY (Misspelling): This class represents misspelled words. Some of these words are misspelled due to lack of Turkish keyboard schemes and some used on purpose. (qIsA - kısa[short], qibi - gibi[like], eylul - eylül[september], etc.)
HTML_entities (HTML Entities): These are HTML entities. These characters are reserved for HTMl, SGML or XML. In order to escape these characters ts_tokenizer converts them to entities. (< ,>, &)
abbr (Abbreviations): This class has abbreviations from TrMorph abbreviations lexicon and TDK. (IMF - [International_Monetary_Fund], LPG - [Liquified_Petroleum_Gas], TBMM - Türkiye Büyük Millet Meclisi[Turkish Grand National Assembly], etc.)
tinglish: This words are from TrMorph lexicon. This file lexicon contains words that are generally direct transliteration of (mostly) English words. (Çöltekin, 2013) (feysbuk - [facebook], tivit - [tweet], etc.)
|