Skip to main content

Machine learning challenges with Imbalanced Data

For many real world machine learning problem we see an imbalance in the data where one class under represented in relative to others. This leads to mis-classification of elements between classes. The cost of mis-classification is often unknown at learning time and can be far too high.

We often see this type of imbalanced classification scenarios in fraud/intrusion detection, medical diagnosis/monitoring, bio-informatics, text categorization and et al.

To better understand the problem, consider the “Mammography Data Set,” a collection of images acquired from a series of mammography examinations performed on a set of distinct patients. For such a data set, the natural classes that arise are “Positive” or “Negative” for an image representative of a “cancerous” or “healthy” patient, respectively. From experience, one would expect the number of noncancerous patients to exceed greatly the number of cancerous patients; indeed, this data set contains 10,923 “Negative” (majority class) and 260 “Positive” (minority class) samples. Preferably, we require a classifier that provides a balanced degree of predictive accuracy for both the minority and majority classes on the dataset. However, in many standard learning algorithms, we find that classifiers tend to provide a severely imbalanced degree of accuracy, with the majority class having close to 100% accuracy and the minority class having accuracies of 0 ~ 10%. Suppose a classifier achieves 5% accuracy on the minority class of the mammography dataset. Analytically, this would suggest that 247 minority samples are misclassified as majority samples (i.e., 247 cancerous patients are diagnosed as noncancerous). In the medical industry, the ramifications of such a consequence can be overwhelmingly costly, more so than classifying a noncancerous patient as cancerous. Furthermore, this also suggests that the conventional evaluation practice of using singular assessment criteria, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning. In an extreme case, if a given dataset includes 1% of minority class examples and 99% of majority class examples, a naive approach of classifying every example to be a majority class example would provide an accuracy of 99%. Taken at face value, 99% accuracy across the entire dataset appears superb; however, by the same token, this description fails to reflect the fact that none of the minority examples are identified, when in many situations, those minority examples are of much more interest.

A data set is considered imbalanced if the class of interest (positive or minority class) is relatively rare as compared to the other classes (negative or majority classes). As a result, the classifier can be heavily biased toward the majority class. These type of sets suppose a new challenging problem for Data Mining, since standard classification algorithms usually consider a balanced training set and this supposes a bias towards the majority class.

A number of approaches, ranging from re-sampling the data set to directly dealing with skewness of the data have been developed to solve the problem of class imbalance. But much of this research has focused on methods for dealing with imbalanced data, without discussing exactly how or why such methods work or what underlying issues they address.

Comments

Popular posts from this blog

Infra store – the next IT marketplace

We are all familiar with the Apple App Store or Google Play Store we visit every day to download apps, games and necessary updates for our phones and tablets. The app store model revolutionized the marketplace idea, making it easy for both software vendors and consumers to publish and install software without the hassles of software building, distribution and deployment. Read further on CSC HyperThink

Do we know the enterprise IT challenges...???

Last night during the dinner chat with one of my old school pal, we stumbled on the topic of current issues that enterprises are stuck with. It went on almost for 30 mins. But what made it less interesting to me is that whole discussion was around cost cutting, our sourcing, rationalization etc., It is really boring, we are still taking about the tip of iceberg. But the question is due we really know what the real challenges are. I am not talking about a laundry list with 30/40/50 items. I am looking why we really have those items? (whatever the count is). I could not get this out of mind and started listing, order, consolidating, prioritizing those items to make sure I am completely confident that as a consultant I am doubly sure about them. Of course, it is debatable. But this is what I think are core problem and rest of list is the symptoms. 1. Dynamic market conditions are forcing business to adopt rapidly while IT is able to respond to this 2. Day by day IT is becoming exp

Just Buzz... Where is AI?

Speaking to Recode’s Kara Swisher and MSNBC’s Ari Melber, Pichai said AI is “one of the most important things that humanity is working on. It’s more profound than, I don’t know, electricity or fire,” adding that people learned to harness fire for the benefits of humanity, but also needed to overcome its downsides, too. Pichai also said that AI could be used to help solve climate change issues, or to cure cancer. We are seeing some exciting things in the industry, Samsung’s massive 8K TVs apparently use AI to upscale lower resolution images for the big screen. Sony has created a new version of the Aibo robot dog, which this time promises more artificial intelligence. Travelmate’s robot suitcase will use AI to drive around and follow its owner wherever they go.  Kohler has invented Numi, a toilet that has Amazon’s Alexa voice assistant built in etc., But despite all this, it does leave me wondering: is artificial intelligence really what we should be calling this revolution? Bec