Balance In Data

I’ve recently stumbled upon a Python library to help manage unbalanced training datasets for your machine learning projects. It’s called imbalanced-learn. It follows Python’s Scikit-Learn’s API which makes implementing it easy and fast.

I’m helping do research for a client for Data for Good’s Edmonton chapter and found that our training data was slightly imbalanced. Trying to increase the performance of our classifier I randomly over-, under-, and combined over and under-sampling on my training data to see if I could produce a better algorithm than simply randomly choosing the majority class. It turned out that all three methods produced a better AUC on all of my classifiers but there was still room for improvement. After a small bit of internet searching I stumbled upon this particular library (imbalanced-learn) and looked into over-sampling techniques. After quickly skimming the API I ended up trying SMOTE, SMOTEENN, and SMOTETomek on the training subset of the client’s data and all of my classifiers increased their performance by a little bit!

Determined to have a better understanding of how the three techniques worked I investigated the papers linked on the library’s website. Here are the sources for the papers for SMOTE, SMOTEENN, and SMOTETomek that I read. Essentially, SMOTE creates new samples (on the minority class) by computing a perturbed differences between its nearest neighbours, SMOTETomek (SMOTE + Tomek Links) is an oversampling and data cleaning technique that applies SMOTE then removes any data where its closest neighbour is a member of the opposite class (in binary classification), and SMOTEENN is another oversampling and noise-reducing method which, again, applies SMOTE and removes observations that don’t conform to two of its three nearest neighbours.

I understand that there are newer techniques to managing class imbalances in your training data, like ADASYN for example, but it was interesting to learn about SMOTE and a couple of other derived methods.