Skip to main content

Sampling strategies for Imbalanced Learning

As discussed in my previous blog, Imbalanced data poses serious challenges in Machine Learning.  One of approach to combat this imbalance is data is to alter the training set in such a way as to create a more balanced class distribution so that the resulting sampled data set can be used with traditional data-mining algorithms. This can be achieved through... 
  1. Under-sample where the size of the majority class is reduced using different techniques like reducing redundancy, removing boundary candidates etc.,
  2. Over-sample where the size of the minority class is increased by adding more candidates which can augment the data set.
  3. Hybrid approach where a combination of both oversampling of minority class and under sampling of majority class is attempted.
Each of these techniques discussed below

Random Over Sampling

In random over-sampling, the minority class instances are duplicated in the data set until a more balanced distribution is reached. As a illustration, consider a data set of 100 items containing 98 majority instances and 2 minority instances. In this 2 minority instances are duplicated repeated, at random, so that complete data set can be balanced across the classes.

This mode of data set duplication leads to the problem of over-fitting (each instance is minority class is copied 49 times leads to lots of generalization). To overcome this problem different techniques of synthesizing data instance based on different attributes

Synthetic Minority Over-sampling technique (SMOTE)

In this technique, the training set is altered by adding synthetically generated minority class instances, causing the class distribution to become more balanced. To create the new synthetic minority class instances, SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

Some of the variants of SMOTE heavily employed includes Borderline SMOTE, Safe-level SMOTE

Clustering Method 

First cluster all of the minority class instances using k-means clustering. They then over sample each of the clusters to have the same number of instances, and the overall dataset to be balanced. The purpose of this method is to identify the disparate regions in the feature space where minority
class instances are found and to ensure that each region is equally represented with minority class instances

Focused Re-sampling

With this technique, only minority class instances that occur on the boundary between minority and majority class instances are over-sampled. In this way, redundant instances are reduced, and better performance can be achieved.

Random Under Sampling

In random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached. As a illustration, consider a data set of 100 items containing 98 majority instances and 2 minority instances. In this 98 majority instances are reduced at random so that complete data set can be balanced across the classes.

This mode of data set reduction generally leads to loss of potentially useful information (nearly 82% of the data from majority class is lost). To overcome these issues various techniques are developed to avoid using useful information by reducing the majority class through redundant noise and / or borderline.

Tomek Links and CNN

This is combination of Tomek Links, which are borderline and noisy instances, and Condensed Nearest Neighbor (CNN), removes redundant instances.

Neighborhood Cleaning Rule

This is based on Edited Nearest Neighbor rule (ENN) where in an instance of the majority class is removed from the dataset if it is misclassified by its three nearest neighbors in the majority class or misclassified by three nearest neighbors in minority class.

Hybrid Techniques

A hybrid approach can over the limitation of over-sampling and under-sampling by employing a combination of both over-sampling and under-sampling to make sure neither too much of information is lost nor too much of over-fitting.

Some of the hybrid techniques to mention, SMOTE+Tomek and SMOTE+ENN where SMOTE is used to over-sample the minority class while Tomek/ENN is using to under-sample the majority class

Ensemble Techniques

Along side these traditional sampling techniques we have ensemble based sampling techniques which promise better sample of data to reduce imbalance. Ensemble based sampling techniques discussed here

Above mention strategies can help balance the training data set. But one big challenge with sampling strategies is deciding how much to sample, which is obviously conditioned on the sampling strategy that is deployed. We have different approaches to help discover the sample sampling strategy and amount of sampling to be done. But in general, this is a difficult optimization problem and may prove impractical in practice depending on the size of the dataset and level of imbalance.

Comments

Popular posts from this blog

Machine learning challenges with Imbalanced Data

For many real world machine learning problem we see an imbalance in the data where one class under represented in relative to others. This leads to mis-classification of elements between classes. The cost of mis-classification is often unknown at learning time and can be far too high. We often see this type of imbalanced classification scenarios in fraud/intrusion detection, medical diagnosis/monitoring, bio-informatics, text categorization and et al. To better understand the problem, consider the “Mammography Data Set,” a collection of images acquired from a series of mammography examinations performed on a set of distinct patients. For such a data set, the natural classes that arise are “Positive” or “Negative” for an image representative of a “cancerous” or “healthy” patient, respectively. From experience, one would expect the number of noncancerous patients to exceed greatly the number of cancerous patients; indeed, this data set contains 10,923 “Negative” (majority class) and

Do we know the enterprise IT challenges...???

Last night during the dinner chat with one of my old school pal, we stumbled on the topic of current issues that enterprises are stuck with. It went on almost for 30 mins. But what made it less interesting to me is that whole discussion was around cost cutting, our sourcing, rationalization etc., It is really boring, we are still taking about the tip of iceberg. But the question is due we really know what the real challenges are. I am not talking about a laundry list with 30/40/50 items. I am looking why we really have those items? (whatever the count is). I could not get this out of mind and started listing, order, consolidating, prioritizing those items to make sure I am completely confident that as a consultant I am doubly sure about them. Of course, it is debatable. But this is what I think are core problem and rest of list is the symptoms. 1. Dynamic market conditions are forcing business to adopt rapidly while IT is able to respond to this 2. Day by day IT is becoming exp

Infra store – the next IT marketplace

We are all familiar with the Apple App Store or Google Play Store we visit every day to download apps, games and necessary updates for our phones and tablets. The app store model revolutionized the marketplace idea, making it easy for both software vendors and consumers to publish and install software without the hassles of software building, distribution and deployment. Read further on CSC HyperThink