Stochastic Discrimination: Historical Background |
|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Although the original paper on stochastic discrimination was first published in 1990 (see [emk90]), the underlying ideas were first lectured on in 1988, and preliminary drafts of the paper were distributed early in 1989. Initial implementations also appeared early in 1989, and the first version of the implementation known as SDK was written in 1991. The original paper, entitled "Stochastic Discrimination", laid out the basic ideas of the general approach, including a discussion of the concept of uniformity, and the mathematics behind proving resistance of the method to overtraining. The paper focused a good deal on the question of speed of computation for the underlying algorithms, and in addition to noting the relevance of this in the application of SD to machine learning and pattern recognition, it used the fact that computation times turned out to be low-degree polynomial to discuss the potential applicability of these methods to complexity theory. The next major paper in this area, entitled "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition" ([emk96]) appeared in the Annals of Statistics, and laid out, with complete mathematical rigor, the full theory behind SD as applied to machine learning and pattern recognition. There was some discussion of algorithmic issues and experimentation included in this paper, but the primary focus was on providing a comprehensive mathematical description of the underlying method. In [emk000], entitled "On the Algorithmic Implementation of Stochastic Discrimination", the ideas of SD are presented via a progression of successively more complex pieces of pseudo-code. The associated algorithms are applied to simple synthetic problems so that the effects of the succession of algorithmic changes are apparent. In this way, pseudo-code for a basic working implementation of SD is developed; in addition, issues associated with the creation of a general working implementation are discussed. The paper concludes with experimental results on a number of standard benchmark problems, and compares these results to results reported for other methods on these same problems, using fixed experimental protocols. The basic picture was completed with the paper [ekm001], entitled "A Mathematically Rigorous Foundation for Supervised Learning" where the primary focus was on laying a rigorous foundation for the subject, examining what it means for a problem in supervised learning to be "solvable". The definition which was developed there is extremely natural, and, in some sense, constitutes a minimal set of (noncircular) conditions without which one could not expect a given problem to have any chance of being solvable in practice. The definition is then shown to justify its billing since one can prove, using the theory of SD, that given any problem in supervised learning which is "solvable" under this definition, one can produce, in low-degree polynomial time, classification algorithms with error rates approaching the Bayes error rate. There are a number of other papers directly focused on the subject of SD. In [tkh&emk0] and [tkh&emk1], implementations of SD designed for the solution of general problems in pattern recognition were discussed. A variation, based on an initial idea of Ho, of the original discriminant function used with SD was the subject of [rb94]. And in [dc98], statistical estimates of error rates for SD when it is applied to real-world problems where strict theoretical norms are not satisfied, but rather, are only satisfied to within some value epsilon, are derived. The interest here is that it shows in mathematically rigorous fashion, the general robustness of the method of SD -- small degradations in theoretical requirements result in small deviations from the predicted outcome. This phenomenon has been observed in experimental practice. There exist combination methods which bear relationships to SD. Many of these, such as Ho's work dealing with Decision Forests (see [tkh95,tkh98]), are derivatives of the original theory of stochastic discrimination (see [emk90,emk96]), where resistance to over-training is maintained by combining uniform streams of (slightly) enriched, projectable weak classifiers. Here, for example, one uses multiple trees built in randomly chosen subspaces of the feature space. For another example of derivative work, see [rb94]. The combination method known as boosting (see [fs96]), based as it is on the notion of iteratively adjusting the distribution of training points so as to de-emphasize the ``easy'' points, also bears a certain relationship to SD. In fact, by viewing boosting in an appropriate way, one can use the mathematical theory of SD ([emk90,emk96]) to provide a theoretical base explaining empirically observed characteristics of boosting, including its performance on non-training data. We do not appeal to an argument involving VC dimension (see [vnv92]), as is done in [fs97]. It is important to stress that the theoretical base provided by SD for boosting is not in the form of a statistically-based argument that a certain algorithm, given appropriately distributed training/test data, produces error rates within certain bounds. It is something more fundamental which is taking place here. Estimates of error rates of combined classifiers are based solely on the error rates of the weak classifiers being combined. This has the advantage of completely bypassing any discussion of distributions of examples, and as such, allows for a potentially more general, and applicable, theory. |