
Many effective algorithms for low-level text processing are statistical.
An interesting problem that an implementor will face when exploring the use of statistical methods for text analysis is the variation in the quality of results with a) the amount of training data and b) the length of the text chunks being processed.
Let's look at the tradeoffs in some of the domains of application.
Documents are typically classified by a) the sentiment or mood expressed in the document (sentiment analysis), b) the language of the document (language classification), or c) its topic (text categorization).
As the documents to be classified grow shorter and shorter, the qiality of classification goes down considerably. This is because classification algorithms usually treat words as events drawn from a multinomial distribution. The number of words seen is important in determining the distribution the observations best fit. At the very least, the accuracy of classification becomes highly dependent on the smoothing techniques in use. You need to keep this in mind when dealing with say Twitter messages.
Language modelling captures the idiosyncrasies of correct usage and form in order to tell plausible sentences apart from implausible ones. Such models are used in tasks like machine translation, spelling correction, grammar correction and language identification.
In the task of language identification, again, it is possible to get accuracies of 98% or thereabouts when the task is performed at the sentence level. The typlical method of performing language identification is to use KL divergence to associate a piece of text to the closest language model. However, come down to the word level, when you have to deal with between 5 and 10 characters, and the quality of the algorithms drops considerably (it becomes as low as 65%).
In the survey paper by Sunita Sarawagi of this name, it is reported that the cost of building a system for rule-based extraction is far less than the cost of building an ML-based system. The costs of corpus annotation, it turns out, are quite prohibitive, and so rule-based systems still remain the most practical method. The engine underlying the IBM Watson, for instance, is SystemT, a rule-based system.
Most algorithms in popular use for sentiment analysis (like the ones used in Radiant6 or OpenAmplify) appear to be rule-based. The simple approach used in these APIs makes them fairly consistent and domain independent, both essential qualities for the measurement of ad campaign performance.
Statistical methods are used in machine translation. Examples are the Google translate engine and the MT systems built by Language Weaver.
Statistical methods are very effective at POS (Part-Of-Speech) tagging - the identification of verbs, nouns, etc., in a sentence. The Stanford Tagger has a percentage accuracy in the higher 90s for English. The corpora required to train a tagger or a machine translation system tend to be enormous.