
Many effective tools for low-level text processing tasks now use statistical methods.
Language modelling captures the idiosyncrasies of correct usage and form in order to tell plausible sentences apart from implausible ones. Such models are used in tasks like machine translation, spelling correction and grammar correction.
Classification involves identification of some surface property of text. The frequencies of certain letters or words yield important clues to topic, sentiment, language style, language type, authorship, etc.
Various corpora can be extracted from raw text using statistical methods. For instance, it is possible to develop dictionaries of domain-specific and technical terms from plain documents by an analysis of the frequencies of word clusters.
Statistical methods are used in machine translation. Examples are the Google translate engine and the MT systems built by Language Weaver.
Statistical methods are very effective at POS (Part-Of-Speech) tagging - the identification of verbs, nouns, etc., in a sentence. The Stanford Tagger has a percentage accuracy in the higher 90s for English.
It would seem that statistical methods fare best at language-related tasks that do not require a knowledge of the grammar or meaning of the text, but they can also be used to improve the accuracy of automation in tasks that do.
In general, it can be said that statistical methods can be applied with good effect to text analysis tasks where there are large amounts of the thing of interest present (be it characters, syllables, words, word-pairs or sentence-pairs).
For instance, it is possible to discover the language of a segment of text very rapidly just by examining short sequences of characters in the text. It is also possible to identify topics from an examination of the frequencies of certain nouns, adjectives, verbs and adverbs (also known as content words).