Sankar Vema - Blog Space

Posts

Showing posts from October, 2011

Sampling strategies for Imbalanced Learning

As discussed in my previous blog, Imbalanced data poses serious challenges in Machine Learning . One of approach to combat this imbalance is data is to alter the training set in such a way as to create a more balanced class distribution so that the resulting sampled data set can be used with traditional data-mining algorithms. This can be achieved through... Under-sample where the size of the majority class is reduced using different techniques like reducing redundancy, removing boundary candidates etc., Over-sample where the size of the minority class is increased by adding more candidates which can augment the data set. Hybrid approach where a combination of both oversampling of minority class and under sampling of majority class is attempted. Each of these techniques discussed below Random Over Sampling In random over-sampling, the minority class instances are duplicated in the data set until a more balanced distribution is reached. As a illustration, consider a data s