Part of Speech Taggers and Morphological Analyzers process the data with specific input formats. In most cases, the input file is a "tokenized" file, where each independent unit is written to a new line. This file format is known as "word-per-file" (wpl) or "vertical file format" (vrt). In this sense, naming tokenization as the very first step of NLP should not be wrong. Even it looks simple tokenization is a hard task to deal with. Mistokenization of input data dramatically reduces the accuracy of postagging or morphological analyses. For accurate tokenization, expected intendeds should clearly defined. Texts in published materials are passing through an editorial process, which mean they are checked for spelling errors and punctuation mistakes. On the other hand, texts on the internet sources like Twitter, Facebook or online forums contain many spelling errors and punctuation mistakes. Also internet has it's own way of writing. There are abbreviations special to internet, smileys and emoticons are widely-used etc.
ts tokenizer is a script that tries to deal with these problems. The script outputs word per line format of the given text, recognizes smileys, abbreviations, punctuation mistakes, dates and so on.
Also the script omits any XML tags as long as they are used properly.