Tuesday, September 28, 2010

Machine learning challenges with Imbalanced Data

For many real world machine learning problem we see an imbalance in the data where one class under represented in relative to others. This leads to mis-classification of elements between classes. The cost of mis-classification is often unknown at learning time and can be far too high.

We often see this type of imbalanced classification scenarios in fraud/intrusion detection, medical diagnosis/monitoring, bio-informatics, text categorization and et al.

To better understand the problem, consider the “Mammography Data Set,” a collection of images acquired from a series of mammography examinations performed on a set of distinct patients. For such a data set, the natural classes that arise are “Positive” or “Negative” for an image representative of a “cancerous” or “healthy” patient, respectively. From experience, one would expect the number of noncancerous patients to exceed greatly the number of cancerous patients; indeed, this data set contains 10,923 “Negative” (majority class) and 260 “Positive” (minority class) samples. Preferably, we require a classifier that provides a balanced degree of predictive accuracy for both the minority and majority classes on the dataset. However, in many standard learning algorithms, we find that classifiers tend to provide a severely imbalanced degree of accuracy, with the majority class having close to 100% accuracy and the minority class having accuracies of 0 ~ 10%. Suppose a classifier achieves 5% accuracy on the minority class of the mammography dataset. Analytically, this would suggest that 247 minority samples are misclassified as majority samples (i.e., 247 cancerous patients are diagnosed as noncancerous). In the medical industry, the ramifications of such a consequence can be overwhelmingly costly, more so than classifying a noncancerous patient as cancerous. Furthermore, this also suggests that the conventional evaluation practice of using singular assessment criteria, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning. In an extreme case, if a given dataset includes 1% of minority class examples and 99% of majority class examples, a naive approach of classifying every example to be a majority class example would provide an accuracy of 99%. Taken at face value, 99% accuracy across the entire dataset appears superb; however, by the same token, this description fails to reflect the fact that none of the minority examples are identified, when in many situations, those minority examples are of much more interest.

A data set is considered imbalanced if the class of interest (positive or minority class) is relatively rare as compared to the other classes (negative or majority classes). As a result, the classifier can be heavily biased toward the majority class. These type of sets suppose a new challenging problem for Data Mining, since standard classification algorithms usually consider a balanced training set and this supposes a bias towards the majority class.

A number of approaches, ranging from re-sampling the data set to directly dealing with skewness of the data have been developed to solve the problem of class imbalance. But much of this research has focused on methods for dealing with imbalanced data, without discussing exactly how or why such methods work or what underlying issues they address.