DMS Tutorial - Evaluation of generated models

Evaluation of models (discovered knowledge)

Resampling techniques for classifier error rate determination

So far, we have seen that the training set error rate can be highly misleading and is usually an overoptimistic estimate of performance. Inaccuracies are due to the over-fitting of a learning system to the data.

The simplest technique for 'honestly' estimating error rates, the holdout method, represents a single train-and-test experiment. However, a single random partition can be misleading for small or moderately-sized samples, and multiple train-and-test experiments can do better.

Random subsampling methods

When multiple random test-and-train experiments are performed, a new classifier is learned from each training sample. The estimated error rate is the average of the error rates for classifiers derived for the independently and randomly generated test partitions. Random subsampling can produce better error estimates than a single train-and-test partition.

Table 5 compares the partitions of cases and the number of iterations for the holdout method vs. random subsampling. Random subsampling solves the problem of relying on a single and possibly uncharacteristic partition by averaging the results over many randomly generated train-and-test partitions. Here n stands for the total number of available cases, j represents the size of the subsample used in training (which can vary from one to n), and B stands for the total number of subsamples.

	Holdout	Random Subsampling
Training cases	j	j
Testing cases	n - j	n - j
Iterations	1	B<<n

Table 5: Comparison of holdout and random subsampling

Before we discuss what size partitions are necessary, we will examine some advantageous ways of partitioning the data.

Cross validation

A special case of resampling is known as leaving-one-out. Leaving-one-out is an elegant and straightforward technique for estimating classifier error rates. Because it is computationally expensive, it has often been reserved for problems where relatively small sample sizes are available. For a given method and sample size, n, a classifier is generated using (n - l) cases and tested on the single remaining case. This is repeated n times, each time designing a classifier by leaving-one-out. Thus, each case in the sample is used as a test case, and each time nearly all the cases are used to design a classifier. The error rate is the number of errors on the single test cases divided by n.

Evidence for the superiority of the leaving-one-out approach is well documented. The leave-one-out error rate estimator is an almost unbiased estimator of the true error rate of a classifier. This means that over many different sample sets of size n, the leaving-one-out estimate will average out to the true error rate. Because the leave-one-out estimator is unbiased, for even small sample sizes of over 100, the estimate should be accurate.

While leaving-one-out is a preferred technique, with large sample sizes it may be computationally quite expensive. However, as the sample size grows, other traditional train-and-test methods improve their accuracy in estimating error rates.

The leaving-one-out error estimation technique is a special case of the general class of cross-validation error estimation methods. In k-fold cross-validation, the cases are randomly divided into k mutually exclusive test partitions of approximately equal size. The cases not found in each test partition are independently used for training, and the resulting classifier is tested on the corresponding test partition. The average error rate over all k partitions is the cross-validated error rate. This procedure was extensively tested with varying numbers of partitions, and 10-fold cross-validation seemed to be adequate and accurate, particularly for large samples where leaving-one-out is computationally expensive. Empirical results also support the stratification of cases in the train-and-test sets to approximate the percentage (prevalence) of each class in the overall sample.

Table 6 compares the techniques of error estimation for a sample of n cases. The estimated error rate is the average of the error rates over the number of iterations. These error estimation techniques were known from 19060s, but only with the increase in computational speed of modern computers, these techniques became more practical today for larger samples and more complex learning systems.

	Leaving-one-out	10-fold CV
Training cases	n - 1	90%
Testing cases	1	10%
Iterations	n	10

Table 6: Cross-validation estimators

The great advantage of cross-validation is that all the cases in the available sample are used for testing, and almost all the cases are also used for training the classifier.

Bootstrapping

The problem of finding the best estimator for small samples is particularly intriguing. It is not at all unusual to have a great shortage of samples. For example, medical studies are often initially done with few patients. Much attention has been given to the small-sample problem.

Traditionally a small statistical sample has been considered to be around 30 or fewer cases. For many years, leaving-one-out was the recommended technique for evaluating classifier performance on small samples, and its use was confined to them. This was mostly due to the computational costs for applying leaving-one-out to larger samples. Because leave-one-out estimators are virtually unbiased, the leave-out-one estimator can be applied to much larger samples, yielding accurate results.

For small samples, bootstrapping, a newer resampling method, has shown promise as an error rate estimator.

Although the leave-one-out error rate estimator (cross-validation) is an almost unbiased estimator of the true error rate of a classifier, there are difficulties with this technique. Both the bias and variance of an error rate estimator contribute to the inaccuracy and imprecision of the error rate estimate. While leaving-one-out is nearly unbiased, its variance is high for small samples. Recall that unbiased means that the estimator will, over the long run, average to the true error rate. The leaving-one-out estimate also has a high variance for small samples.

The variance effect tends to dominate in small samples. Thus a low variance estimate that may even be somewhat biased has the potential of being superior to the leaving-one-out approach on small samples.

There are numerous bootstrap estimators, but the two that so far have yielded the best results for classification are known as the e0 and the .632 bootstrap. For the e0 bootstrap estimator, a training group consists of n cases sampled with replacement from a size n sample. Sampled with replacement means that the training samples are drawn from the data set and placed back after they are used, so their repeated use is allowed. For example, if we have 100 cases, then we randomly draw one from the initial 100, put a copy of that case in the training set, and return the original to the data set. We continue to draw cases for the training set until we have the same number of cases as we had in the original data set. Cases not found in the training group form the test group. The error rate on the test group is the e0 estimator. For this technique, it turns out that the average or expected fraction of non-repeated cases in the training group is .632, and the expected fraction of such cases in the test group is .368. The .632 bootstrap, .632B, is the simple linear combination of .368*trerr + .632*e0, where trerr is the training set error rate on all the cases (both training and testing cases). It should be noted that e0 is approximated by repeated 2-fold cross-validation, i.e., 50/50 splits of train-and-test cases.

The estimated error rate is the average of the error rates over a number of iterations. About 200 iterations for bootstrap estimates are considered necessary to obtain a good estimate. Thus, this is computationally considerably more expensive than leaving-one-out. Table 7 summarizes the characteristics of these bootstrap estimators.

Both e0 and .632B are low variance estimators. For moderately sized sample sets, e0 is clearly biased pessimistically, because on the average the classifier trains on only 63.29 of the cases. However, e0 gives extremely strong results when the true error rate is high. As the sample size grows, .632B is overly optimistic, but it is very strong on small samples when the true error rate is relatively low.

	Bootstrap
Training cases	n (j unique)
Testing cases	n - j
Iterations	200

Figure 7: Bootstrap estimators

The bootstrap estimators are not always superior to leaving-one-out on small samples. However, low error rates for either the e0 bootstrap estimate or repeated 2-fold cross-validation (i.e., 50/50 train-and-test splits) are stronger indicators of good classifier performance than leaving-one-out estimates.

Recipes for the design of a good classifier

Because our goal is to build a classifier with the lowest true error rate, we have reviewed the various techniques for error estimation. For many classification techniques, the goal can also be stated as finding the best fit to the sample data without over-fitting the learning system. The evaluation of performance of any method requires an estimate of the true error rate. Several other methods will also need to estimate an additional parameter that measures the complexity fit. The exact nature of the fit metric depends on the type of representation or general model, such as, for example, decision trees.

The principles of classifier design and testing are quite general, and the error estimates are independent of a specific classification method. Based on the results and experiences reported in the literature, general guidelines can be given to extract the maximum amount of information from the samples. While there are many options for training and testing, we describe next those that have been found to be best and have been reported in the literature.