Real-life classification problems consist of many cases in which the data is not "beautifully" balanced. In such cases, one may find the accuracy score of over 90%, but that may not be the happy ending. This is because we often want to focus only on the minority class, as it plays more important role in such problems. In order to tackle this, we need to come up with proper evaluation metrics and resampling techniques. One of the popular problem regarding imbalanced dataset is Credit Card Fraud Detection. Throughout this problem, we can illustrate well what we need to handle such an imbalanced dataset.
Preparing the Dataset
The dataset is obtained from Kaggle. For this dataset, it is clear that we want to focus on detecting the fraud cases. As a matter of fact, the number of fraud transactions in real life is much less than non-fraud cases, which is reflected exactly in this dataset.
Here are brief information of the dataset.
- Class: 0 - non-fraud 1 - fraud
- Amount: Transaction amount
- V1,V2,...,V28: These are anonymous features, due to confidentiality. Additionally, these are numerical values which are results of PCA transformation.
- Time: The amount of seconds elapsed between each transaction and the first transaction in the dataset.
Let's import libraries first:
# Import libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Let's check what our data looks like:
data = pd.read_csv('creditcard.csv') data.head()
Next, let's check if our dataset contains any null variables:
data.isnull().values.any() # Returns False
In order to check how skewed our data is, let's check for the percentage of each class:
# Check ratio between classes percentage_fraud = round((data['Class'].value_counts() / len (data)) * 100, 2) percentage_no_fraud = round((data['Class'].value_counts() / len(data)) * 100, 2) print ('Percentage Fraud transactions: ', percentage_fraud) print ('Percentage No-fraud transactions: ', percentage_no_fraud)
Fraud-transactions occupy only 0.17% of the dataset, this dataset is heavily skewed.
Data is beautiful, so let's plot it to visualize the skewness:
plt.figure(figsize=(7,7)) sns.set(style="darkgrid") sns.countplot(x="Class", data=data)
Setting Input Variables
Remember the anonymized predictors that is already transformed with PCA ?
Here we still have two predictors Amount and Time that has not been scaled yet. So we need to normalize these features also.
from sklearn.preprocessing import StandardScaler, RobustScaler rob_scaler = RobustScaler() data['scaled_amount'] = rob_scaler.fit_transform(data['Amount'].values.reshape(-1,1)) data['scaled_time'] = rob_scaler.fit_transform(data['Time'].values.reshape(-1,1)) # Get rid of Time and Amount data.drop(['Time','Amount'], axis=1, inplace=True) # Let's look at the data again ! data.head()
There are various techniques implemented for dealing with imbalanced dataset. Some popular strategies include improving classification algorithm to fit better with imbalanced dataset, or balancing the classes of training data before providing data as input to the algorithm (data resampling techniques). The second technique is more preferable as it has wider application.
Some popular resampling techniques including:
- Random undersampling
- Random oversampling
- SMOTE (Synthetic Minority Over-sampling Technique)
Despite the advantage of balancing classes, these techniques also have their weaknesses. The simplest implementation of over-sampling is to duplicate random records from the minority class, which can cause overfitting. Whereas in under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.
For the sake of simplicity, i only implement the two simpler techniques within, which are Random Oversampling and Random Undersampling. However, if you are interested in more advance technique, you can look up for SMOTE, as it is more robust in most of the cases.
Here is an important notice: The testing dataset should be created before the resampling process. Moreover, the model evaluation is supposed to be done based on the test data with original data, not after the data resampling process, so that we do not leak the information of training into testing set.
The simple idea here is to compare the performance of the models on the original dataset with two resampled dataset (oversampled and undersampled).
X = data.drop ('Class', axis = 1) y = data['Class'] from sklearn.cross_validation import train_test_split # Whole dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0, stratify = y)
Obtaining Smaller Training Dataset
As the total number of training transactions is too large, which may damage my computer. So I obtain the smaller training dataset with the same ratio of classes of original training dataset. This newly created training dataset is treated as originally skewed training data from this point.
training_data = pd.concat ([X_train,y_train],axis = 1) training_data['Class'].value_counts()
print ('Percentage original fraud: ', percentage_fraud) print ('Percentage original no-fraud: ', percentage_no_fraud) number_of_instances = 100000 # We will obtain maximum 100.000 data instances with the same class ratio of original data. # Therefore, new data will have 0.17% fraud and 99.83% non-fraud of 100.000. # Which means, new data will have 170 fraud transactions and 99830 non-fraud transactions. number_sub_fraud = int (percentage_fraud/100 * number_of_instances) number_sub_non_fraud = int (percentage_no_fraud/100 * number_of_instances) sub_fraud_data = training_data[training_data['Class'] == 1].head(number_sub_fraud) sub_non_fraud_data = training_data[training_data['Class'] == 0].head(number_sub_non_fraud) print ('Number of newly sub fraud data:',len(sub_fraud_data)) print ('Number of newly sub non-fraud data:',len(sub_non_fraud_data)) sub_training_data = pd.concat ([sub_fraud_data, sub_non_fraud_data], axis = 0) sub_training_data['Class'].value_counts()
Assigning X and y for Newly Created Sub-Dataset
X_train_sub = sub_training_data.drop ('Class', axis = 1) y_train_sub = sub_training_data['Class'] y_train_sub.value_counts()
Randomly Under-Sampling the Training Dataset
For simplicity, i use DataFrame.sample() to randomly sample the instances of each class:
# Fraud/non-fraud data fraud_data = training_data[training_data['Class'] == 1] non_fraud_data = training_data[training_data['Class'] == 0] # Number of fraud, non-fraud transactions number_records_fraud = len(fraud_data) number_records_non_fraud = len (non_fraud_data) under_sample_non_fraud = non_fraud_data.sample (number_records_fraud) under_sample_data = pd.concat ([under_sample_non_fraud, fraud_data], axis = 0) # Showing ratio print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data)) print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data)) print("Total number of transactions in resampled data: ", len(under_sample_data)) # Assigning X,y for Under-sampled Data X_train_undersample = under_sample_data.drop ('Class', axis = 1) y_train_undersample = under_sample_data['Class'] # Plot countplot plt.figure(figsize=(7,7)) sns.set(style="darkgrid") sns.countplot(x="Class", data=under_sample_data)
Randomly Over-Sampling the Training Dataset
I do the same with over-sampling technique.
# Fraud/non-fraud data fraud_data = sub_training_data[sub_training_data['Class'] == 1] non_fraud_data = sub_training_data[sub_training_data['Class'] == 0] # Number of fraud, non-fraud transactions number_records_fraud = len(fraud_data) number_records_non_fraud = len (non_fraud_data) over_sample_fraud = fraud_data.sample (number_records_non_fraud, replace = True) # with replacement, since we take a larger sample than population over_sample_data = pd.concat ([over_sample_fraud, non_fraud_data], axis = 0) # Showing ratio print("Percentage of normal transactions: ", len(over_sample_data[over_sample_data.Class == 0])/len(over_sample_data)) print("Percentage of fraud transactions: ", len(over_sample_data[over_sample_data.Class == 1])/len(over_sample_data)) print("Total number of transactions in resampled data: ", len(over_sample_data)) # Assigning X, y for over-sampled dataset X_train_oversample = over_sample_data.drop ('Class', axis = 1) y_train_oversample = over_sample_data['Class'] # Plot countplot plt.figure(figsize=(7,7)) sns.set(style="darkgrid") sns.countplot(x="Class", data=over_sample_data)
Evaluation Metrics in case of Imbalanced Dataset
This is a clear example where using a typical accuracy score is no longer appropriate. For example, within this dataset, if we just assign all the class to Non-fraud transactions, we can barely have accuracy score of 99.83% since the original data has 99.83% of non-fraud transactions.
On the other hand, we are very interested in the Recall score, because that is the metric that will help us try to capture the most fraudulent transactions.
- Precision = TP/(TP+FP)
- Recall = TP/(TP+FN)
- TP: True Positives
- FP: False Positives
- FN: False Negatives
- TP: actually Fraud and predicted as Fraud
- FP: actually Fraud but predicted as Normal
- TN: actually Normal and predicted as Normal
- FN: actually Normal but predicted as Fraud
Due to the imbalance issue, many observations can be predicted as Normal transaction, whereas they are actually Fraud transactions, which is False Negatives. Recall penalizes the False Negatives. Obviously, trying to increase Recall, tends to come with a decrease of Precision. However, in our case, if we predict that a transaction is fraudulent and turns out not to be, is not a massive problem compared to the opposite situation.
Time for Machine Learning Models
Here, i decided to use SVM and Logistic Regression to fit the original (the imbalanced one), over-sampled, under-sampled dataset and compare the result correspondingly.
Evaluate the Models with Original Dataset
# SVM svm.fit(X_train_sub, y_train_sub) #Logistic Regression lr.fit(X_train_sub, y_train_sub) # Note: We should test on the original skewed test set predictions_svm = svm.predict(X_test) predictions_lr = lr.predict(X_test) # Compute confusion matrix cnf_matrix_svm = confusion_matrix(y_test,predictions_svm) cnf_matrix_lr = confusion_matrix(y_test,predictions_lr) recall_svm = cnf_matrix_svm[1,1]/(cnf_matrix_svm[1,0]+cnf_matrix_svm[1,1]) recall_lr = cnf_matrix_lr[1,1]/(cnf_matrix_lr[1,0]+cnf_matrix_lr[1,1])
For the Undersampled and Oversampled cases, I do the same with the corresponding X_train_undersample, y_train_undersample and X_train_oversample, y_train_oversample
After running all the codes above, here are the Recall scores that I obtained:
Original Data (Imbalanced)Undersampled DataOversampled DataSVM47.9789.8655.4Logistic Regression63.5189.8689.18
For the original skewed dataset, both of the models perform badly. However, with Undersampled data, the Recall scores increase significantly in case of both classifiers. It is noticeable that with Oversampled data, only Logistic Regression can significantly increase the Recall, whereas SVM only slightly enhance it.